Torn over the C++ question

December 2nd, 2009 Josh 5 comments

I am having a very difficult time deciding whether to go through with the C++ port of upb or to stay in C.

I’ve ported about one third of upb to C++, on a branch, to see how it would turn out. It was a ton of work. Here are my current observations:

  • The C++ is cleaner, more readable, less error-prone code. It’s just a fact. Compare for yourself (C: upb_def.h, upb_def.c; C++: upb_def.h, upb_def.cc). This is due to numerous factors:
    • type-safe containers means fewer casts.
    • “public” and “private” keywords make it easy to separate the private parts of your interface, without having to specify in comments which is which.
    • namespaces and class scope mean that I don’t have to write out my identifiers like upb_fielddef_dothis(), I can just write DoThis().
    • real inheritance and member classes mean I don’t have to explicitly call all the right constructors/destructors, or write explicit casts for upcasts
    • destructors that are guaranteed to run on scope exit mean I can use RAII patterns like mutexes that automatically unlock when the scope is exited
  • The source got shorter; the portion I ported went from 1483 lines to 1133, or a ~30% reduction.
  • The binary got a LOT bigger. I had one function get literally 5x as big. I haven’t figured out why this happened yet. I used templates to make the table generic, but I was extremely careful to make sure that the template only generated a small amount of code — basically just the hash lookup routine, which is small (note: the hash function for strings was not templated or inlined). But another issue is that the C++ compiler appears to emit multiple copies of the same function in the same object file! For example, I found some virtual destructors emitted literally three times in the same file. Why is this?
  • I just heard back from a security guru from the Google security team, who said that C is often easier to audit than C++ because it’s easier to figure out what is actually going on, without having to dig through layers of abstraction. This surprised me (maybe it shouldn’t have, since Sam Quigley said the same thing in a comment on my last entry), but I was also a little bit relieved.

I’m leaning towards sticking with C, for the following reasons:

  • C++ compilers aren’t very good at keeping things small, even when you are juducious with your use of templates.
  • C++ compilers are much more complicated that C compilers, and therefore not as ubiquitous or as easy to trust generally.
  • C isn’t harder to audit for security than C++, and may actually be easier.

I’ll try to take some of the lessons I learned from my partial C++ port to make the C more readable.

Categories: Uncategorized Tags:

Gazelle/upb status and plans (aka: On Releasing)

November 28th, 2009 Josh 1 comment

This summer my friends Ben and Mike gave me grief about never releasing anything. Their criticism is definitely valid to some degree. I’ve been working on Gazelle for about two years now, and upb for almost one. Gazelle has had four releases in that time, but they have mostly focused on moving Gazelle to where I think it ought to be, as opposed to releasing something hacky that people can actually use now. There is a class of problems that Gazelle is useful for now, but it is pretty small in comparison to the amount of work I’ve put in.

I haven’t released upb at all yet, and my last message indicating I’m thinking of porting it to C++ will probably make skeptical readers think I’m moving farther away from a release rather than closer to one.

Since I agree that my progress doesn’t look too promising to someone observing from the outside, let me say where I think these projects currently are, where they’re going, and when they’re likely to release.

First of all, Gazelle is currently pushed on the stack until I have upb released. The reason is that I realized that Protocol Buffers are the answer to two big problems I was facing with Gazelle:

  1. byte-code format: right now the Gazelle byte-code format is LLVM’s BitCode, which is the format LLVM uses for storing its byte-code internally. I invested a lot into BitCode (you’ll notice my name is on the linked document), including writing a standalone encoder and decoder (230 lines of Lua and 856 lines of C, respectively). But this was before I worked at Google or knew about Protocol Buffers. Protocol Buffers are much easier to use because they have a formal schema (the .proto file) that can generate nice APIs and help you out with backward compatibility. Without a format schema, BitCode makes you resort to things like an ad hoc text file that describes the schema. This approach was showing its limits.
  2. parse tree format: I always knew I wanted Gazelle to be capable of generating parse trees in some kind of standard format. Protocol Buffers end up being a match made in heaven, since they are isomorphic to parse trees in a very deep way. Indeed, the ast system for the Elkhound Parser is very much like Protocol Buffers in that you define your parse tree format and it generates classes for representing your AST.

Since Gazelle is gated on upb, the question then is: when will upb release? Why hasn’t it released already?

A few months ago I was working on upb for 100% of my time at work. I had banked 20% time for a while, and I was also a bit burned out on my 80% project, so my manager very graciously gave me the liberty to work on upb for all of my working hours.

During that time upb made progress in several areas. It got some better benchmarks and tests, and I fleshed out the upb compiler so that it wasn’t dependent on the official Protocol Buffers compiler for bootstrapping. Maybe most importantly, I worked a lot on the in-memory message format to figure out how to make it work well with dynamic languages.

My goal during that time was to write a Python extension that a few initial internal-to-Google customers could use. The value proposition is that it would be API-compatible with what they were already using, but many times faster. I wrote said extension, which was incomplete (supported decoding only, not encoding), but looked complete enough to use for this case.

By this time I was approaching the amount of time I could reasonably ask from my manager at work, so I had to tie up the loose ends and get it into my initial customer’s hands. I put all the pieces together and tried it out, but then ran into a problem; I hadn’t realized that this initial customer was using an old deprecated feature of Protocol Buffers called MessageSet. There was no way I could support MessageSet without significant changes. I was defeated for the moment. I had to take a break for a few months and re-devote my time to my 80% project.

I mention this all just to illustrate that I do have actual customers that I am targeting, and I have had aggressive pushes to deliver something to those customers, but unfortunately my work wasn’t complete enough for them yet.

This brings us up to now. In the last week or two, I have made several strides, including executing on part of a design that will get me MessageSet support. I have also developed an interface for a “pick parser”, which lets you pull only a small subset of fields out of a protobuf. This will be a big win for use cases that only need a few fields from a very large proto, and I have a customer internal-to-Google who is very interested in this interface.

Meanwhile I’m very interested in trying to get the upb Python extension into AppEngine, because I think it could be a huge win there since users aren’t allowed to load custom Python extensions. This means that currently, people trying to use protocol buffers on AppEngine are limited to pure-Python extensions that are much slower than a C extension can be. But to get into AppEngine I will need to get a security audit, which is part of the reason I am leaning towards C++ at this point. I think C++ will make the code shorter and less gnarly (fewer casts), which should lead to easier verifiability. I converted one header file so far, and it got 38% smaller and much easier to read.

I hesitate to make schedule estimates, but my main purpose is to impress on my possibly-impatient audience that:

  • I do have motivation to release.
  • I do have initial customers and initial use cases.
  • I am making progress.
  • I am currently focused on delivering (1) a pick parser, (2) a Python extension, (3) an easily-auditable code-base.
  • I look forward to being able to announce my first release!

Yours,
Josh

Categories: Uncategorized Tags:

Porting upb to C++?

November 28th, 2009 Josh 7 comments

I am on the verge of trying something I never thought I’d do. I’m considering porting upb to C++.

My reasons aren’t ideological, they are highly practical. Basically I am realizing that while object-oriented C is OK for a while, it’s very weak at inheritance. Inheritance in C involves a lot of casting, duplicated code and/or macros, and careful discipline. The main problems with this are:

  • the code gets longer and less readable
  • the code involves more possibly-unsafe operations like casts

Both of these problems make the code ultimately more difficult to audit for security. And getting upb audited for security is something I plan to do very soon.

I am coming to believe that porting to C++ would make upb smaller (in lines of code) and easier for verify for security. However, there are a few major disadvantages that are giving me pause:

  • there are still some contexts in which C++ is a no-go, like the Linux kernel, embedded systems that only have a C compiler (but no C++), or projects that want to stay C-only. Doing this port would make upb unsuitable for these contexts.
  • projects that are currently C-only would need to create C++ source files to call upb APIs, and will have to link in the C++ runtime
  • (possible) C++ could result in a larger binary.

When I look at the downsides though, they don’t seem to pertain to my initial goals of making upb useful for Python, Lua, Ruby, etc. extensions, and for use inside Google. Being useful for really restricted embedded systems is a far-off use case. So it’s sounding like porting to C++ is the right thing to do.

I hope it significantly reduces the line count, as I expect it will. That will make me feel better about giving up the minimalism of C. I will definitely be compiling with -fno-exceptions -fno-rtti -fvisibility-inlines-hidden on gcc. I also won’t be using any of the C++ standard library (not even <string>).

Categories: upb Tags:

Wanted: a portable mutex and atomic refcount

August 14th, 2009 Josh 1 comment

upb needs to have some lightweight thread-aware behavior. I’m leaving most synchronization up to users (individual messages will not be thread-safe), but there are a few central structures I need to make thread-safe and reference-counted.

I need only the tiniest bit of functionality:

  • a portable mutex.
  • a portable atomic_t that lets me atomic_inc() and atomic_dec().

We’re talking “lives in one single header” small. The mutex would just be wrappers around existing mutex implementations (pthreads, windows, etc), and since those routines typically take care of any memory barriers you need to safely read/mutate the shared state, I wouldn’t have to worry about that.

The atomic type would have to be hand-coded and architecture-specific, since most threading libraries don’t provide one. The reason for providing this would be reference-counting. If you are reference-counting an immutable structure, then you don’t need to worry about memory barriers to ensure the consistency of that structure; if you’re reference-counting a mutable structure, then you’ll need to protect the mutable state with mutexes and acquire the mutex before freeing anything.

The library (er, header file) should also support compiling everything to nothing if NO_THREAD_SAFETY is defined as a preprocessor symbol.

Yes, that all sounds good. Tiny yet functional. I’ll be writing this very soon unless something exactly like what I’ve described happens to already exist.

Categories: Uncategorized Tags:

Wanted: a mailing list reader website

August 10th, 2009 Josh No comments

There is tons of interesting discussion that happens on technical mailing lists. Mailing lists are the best snapshot of the state of a software project; they capture what current users are trying to do, where they’re succeeding, where they’re running into trouble, and what the current plans are for making things better.

Unfortunately, there is still no good way (AFICS) to lurk on high-volume mailing lists. Your current options are:

  • Subscribe your personal email address to the mailing list. Pros: threads well, tracks “read” status well, easy to reply. Cons: too much overhead to subscribe to a new list (subscribe, confirm, set up mail filter), mail builds up as unread if you don’t read it for a while. Overwhelming for high volume lists. Not convenient when your level of interest in a list varies.
  • Read on gmane.org. Pros: threads somewhat well (new messages on old threads get lost, because the whole thread is sorted based on when its first message arrived), tracks your “read” status (but not across browsers or computers), easy to track a single list and only read what interests you. Cons: not easy to track multiple lists. There’s no top-level “what’s new on lists I care about” view. Not easy to reply (the built in editor makes you wrap lines yourself).
  • RSS feed from gmane.org. Pros: easy to track multiple lists, old mail doesn’t build up if I don’t read it for a while, lets me read mailing list posts alongside my other feeds. Cons: RSS is a terrible match for mailing lists. It doesn’t understand threads at all. No easy way to reply. Also, the gmane RSS feeds link you to the blog interface, which is equally terrible at threading.
  • The Lurker email archiver. I’ve been a fan of this project for a while. Pros: GREAT interface (check out this demo site), top-level page that summarizes threads and their activity, thread view that shows you replies according to both time and threading. Cons: you have to run it yourself (it’s a project, not a service), doesn’t remember what you’ve read, and no easy way to reply.

What I would love is a website that would let me easily lurk on mailing lists. I’d love an interface like Lurker, but that I log into so that it knows what I’ve read. I’d want a top-level view that shows popular threads across ALL mailing lists I’m lurking on, not just one mailing list at a time. And I’d want the ability to easily reply to mailing list threads.

If anyone knows a better way to get what I want, please let me know!

Categories: Uncategorized Tags: