It has been a long time since I have posted any updates about upb, my implementation of Protocol Buffers. In 2011 I posted the article upb status and preliminary performance numbers, showing that upb’s JIT compiler yielded impressive performance results parsing Protocol Buffers, particularly when used as a stream parser. Since then there has been a lot of activity in GitHub, but still no releases, and I haven’t posted much about upb or where it’s going. It’s time for a status update

On a personal note, I also have exciting news to report: as of last August, I have joined the Protocol Buffers team at Google officially. After five years of being very interested in Protocol Buffers and working on upb as my 20% project, I now work on Protocol Buffers full time (and still work on upb 20%).

As has always been the case, on this blog I speak only for myself: not for my team, or for Google!

But onto upb status updates. A lot has happened since I last wrote. There have been hundreds of commits, and lots of changes, big and small. Obviously I won’t cover everything, but I’ll start with big-picture news, then get a little more specific, then talk vision for a bit.

haberman/upb is now google/upb

Up until this week, upb’s main repository lived on GitHub under my own username. As of two days ago, its home is under the Google organization. As a 20% project, upb has always been under Google copyright. Living under the Google organization will make it easier for Google to ensure that it follows policies surrounding CLA (contributor license agreements) and such.

This also makes upb a bit more “official” as an open-source offering of Google. This will enable Google to incorporate upb into other projects/products, which brings me to my next bit of news.

Google is using upb for the MRI Ruby Protocol Buffers implementation

Though the code has been public on GitHub for months now, I’m very happy to formally announce that the official Google-supported Ruby library for protobufs (on MRI) uses upb for parsing and serialization! The Ruby-specific code was written by my excellent teammate Chris Fallin. The data representation (ie. the Ruby classes representing protobufs) is all Ruby-specific, but under the hood it uses upb for all parsing and serialization.

One of upb’s goals from the beginning was to be an ideal implementation for dynamic languages like Ruby, Python, PHP, etc, so it is extremely gratifying to finally see this in action.

When we were evaluating how to implement the Ruby Protocol Buffers library, one of major questions was whether to implement it in pure Ruby or to use a C extension. Staying pure Ruby has notable portability benefits, but at the end of the day the performance benefit of a C extension approach was too great to pass up. The micro-benchmark numbers of our extension vs. a couple of pure-Ruby implementations make this clear:

Test payload is a 45-byte protobuf

Ruby Beefcake    (pure-Ruby):                0.7MB/s
Ruby Protobuf    (pure-Ruby):                0.8MB/s
Google Protobuf  (C accelerated with upb):  11.9MB/s

Representative microbenchmarking is notoriously difficult, but the point of this is just to show that C acceleration when it comes to parsing makes a big difference. That is a big part of why I created upb to begin with.

For a lot of use cases, the difference doesn’t matter. Lots of people who make remote API calls just need to parse a modestly-sized payload, and it’s not in a tight loop or anything. These people probably won’t care about the perf difference between Beefcake and Google Protobuf.

But there always end up being people who care. When the payload gets big enough, or the loop gets tight enough, a 10x difference is huge. We see this at Google all the time. And just witness the amount of effort that people have put into optimizing JSON parsing – for example, the oj (Optimized JSON) library for Ruby. We wanted to be sure that the protobuf library would be at least speed-competitive with the (pretty fast) built-in JSON library that comes with Ruby. And in that we have succeeded – in my one silly test, we are roughly speed-competitive with the JSON, (a little faster or a little slower, depending on how you measure).

This doesn’t help JRuby users unfortunately, but I am very happy that open-source contributors have come forward to implement an API-compatible version of the library for JRuby by wrapping the Protocol Buffers Java implementation. So both MRI and JRuby can get fast parsers, and code should be portable between the two. The cost is having to maintain two implementations under the hood.

By the way, the Ruby Protocol Buffers library uses upb, but it doesn’t use upb’s JIT at all for parsing. The numbers above reflect upb’s pure-C, bytecode-based decoder. There is no point in enabling the JIT, because the actual parsing represents such a small portion of the overall CPU cost of parsing from Ruby – upb by itself can parse at more than 10x the speed of the Ruby extension, even with the pure C parser. The vast majority of the time is spent in the Ruby interpreter creating/destroying objects, and other Ruby-imposed overhead. Enabling the JIT would barely even be a measurable difference. This is nice because avoiding the JIT here means we avoid the portability constraints of the JIT.

JSON support

The Ruby implementation of Protocol Buffers is part of a larger effort known as “proto3” (see here and here). I won’t go too deeply into what proto3 is about because that would be its own blog entry. But one important element of the proto3 story is that JSON support is becoming first-class. Protobuf binary format and JSON format are both options for sending payloads. If you want a human-readable payload, send JSON – proto3 libraries will support both.

Since Ruby-Protobuf uses upb for all of its parsing/serialization, that must mean that upb now supports JSON, and indeed it does!

Everything in upb is stream-based; this means that upb supports JSON<->protobuf transcodes in a totally streaming fashion. Internally the protobuf encoder has to buffer submessages, since Protocol Buffer binary format specifies that submessages on the wire are prefixed by their length. But from an API perspective, it is all streaming. (It would be possible to write a version of the encoder that avoids the internal buffering by doing two passes, but that is an exercise for a different day).

Protocol Buffers and JSON, next the world?

I had an epiphany at some point, which was that upb could be a more generalized parsing framework, not limited to just Protocol Buffers.

upb was inspired by SAX, the Simple API for XML. The basic idea of SAX is that you have a parser that calls callbacks instead of building a data structure directly. This model makes for fast and flexible parsers, and has been imitated by other parsers such as YAJL (for JSON).

So upb supports parsing protobufs by letting you register handler functions for every protobuf field. Parsing field A calls handler A, parsing field B calls handler B, and so on.

But what if we could apply this basic idea to the use case of SAX? What if upb could parse XML, and call handlers that are specific to the XML data model? We could use .proto files to define an XML-specific schema. For SAX this might look like:

// Protocol buffer schema representing SAX
message StartElement {
  string element_name = 1;
  map<string, string> attributes = 2;
}

message EndElement {
  string element_name = 1;
}

message ProcessingInstruction {
  string target = 1;
  string data = 2;
}

message XMLContent {
  oneof content {
    StartElement start = 1;
    EndElement end = 2;
    bytes characters = 3;
    bytes ignorable_whitespace = 4;
    string skipped_entity = 5;
    ProcessingInstruction processing_instruction = 6;
  }
}

message XMLDocument {
  repeated XMLContent content = 1;
}

Now we can offer basically the same API as SAX, except the set of handlers is defined as a protobuf schema, instead of being hard-coded into the library.

What if we try the same idea with JSON?

message JsonArray {
  repeated JsonValue value = 1;
}

message JsonObject {
  map<string, JsonValue> properties = 1;
}

message JsonValue {
  oneof value {
    bool is_null = 1;
    bool boolean_value = 2;
    string string_value = 3;

    // Multiple options for numbers, since JSON doesn't specify
    // range/precision for numbers.
    double double_value = 4;
    int64  int64_value = 5;
    string number_value = 6;

    JsonObject object_value = 7;
    JsonArray array_value = 8;
  }
}

Now we can offer the same API as YAJL.

Or take HTTP, according to the handlers specified in http-parser

message HTTPHead {
  uint32 http_major = 1;
  uint32 http_minor = 2;
  uint32 status_code = 3;

  enum Method {
    DELETE = 1;
    GET = 2;
    HEAD = 3;
    PUT = 4;
    POST = 5;
    // ...
  }

  Method method = 4;
  string url = 5;

  map<string, string> headers = 6;
}

message HTTPBody {
  repeated bytes chunk = 1;
}

message HTTP {
  HTTPHead head = 1;
  HTTPBody body = 2;
}

Do you see where this is going? SAX, YAJL, and http-parser are great APIs. upb seeks to be a generalization of streaming parser APIs, by using a Protocol Buffer schema to define the set of handlers for each format

You may be tempted to say “bloat.” But extending upb in this way doesn’t make the core any bigger than if it just supported Protocol Buffers. The core library already contains all the functionality for registering handlers to match a protobuf schema. We aren’t growing the core library at all to enable these other use cases.

And coming from the other direction, the core library of upb is only about 25kb of object code. It just contains data structures for representing schemas and handlers. For many libraries, like libxml or expat, 25kb is not a very noticeable overhead. For some exceptionally small libraries like YAJL or http-parser, 25kb would be a noticeable overhead, but in most cases the overhead is acceptable.

And what does this abstraction buy you? An easy way to expose your parser to multiple languages.

Take Ruby. Since the Google Protocol Buffers library for Ruby is implemented using upb, all of the handlers already exist to read and write Protocol Buffers objects. So if you had an HTTP parser that used upb handlers according to the handler schema above, it would be trivial to parse some HTTP into an HTTP Protocol Buffers object in Ruby. You can use the Protocol Buffer object as your parse tree / in-memory representation! Once you do that you can easily serialize it as a binary Protocol Buffer, or as JSON, if you want to preserve its parsed form. You could easily write a filter program in Ruby that iterates over a bunch of HTTP sessions and pulls out the URL. You have all these capabilities for how you can use the parser in Ruby, but you’ve had to do barely any Ruby-specific work to expose the parser to Ruby.

If you keep going down this path, you realize that this same idea could work for streaming string->string transformations (like gzip or SSL), and even for streaming aggregations like HyperLogLog. And the ultimate dream is that this all can compose nicely.

This is the vision which I am constantly working towards. But I’m also grounded by the need to deliver tangible things like something that Ruby and other language implementations can use right now.

Back to Planet Earth: Concrete Status

All of those lofty ideas are nice, but the devil is in the details. Here is the status of some of those details, in super-brief form:

protobuf encoder and decoder are fully complete (except for vending unknown fields, which I’d like to add at some point)
protobuf decoder (both pure-C and JIT) are completely rewritten. Pure-C decoder is now bytecode-based, and JIT is a code generator from that bytecode. Both decoders are fully-resumable, for when the input spans multiple buffers.
typed JSON parser and printer are pretty complete, following the JSON encoding conventions of proto3. “Typed JSON” means JSON that follows the structure of a protobuf schema.
upb now has a stable ABI; structure members and sizes are hidden in implementation files.
upb now supports fully-injectable memory allocation and error callbacks for parse-time structures. (Sharable, ahead-of-time structures like schemas and handlers still rely on a global malloc()).
there is now an “amalgamation” build where all upb sources are collapsed into a single .h/.c file pair, to make it easy to drop into another project’s build (the Ruby extension uses this

But there is still a lot to do:

need to split generic JSON encoder from typed JSON encoder. (Generic JSON encoder will use a schema like the above to represent arbitrary JSON, like YAJL, instead of being limited to a structured schema).
need to port to c89. This is a tough one for me. The title of this blog is “parsing, performance, minimalism with c99.” But practical experience is just making it clear that we’re still in a c89 world if you want maximum portability (and particularly the ability to integrate into other build systems).
need to tweak the handlers model to be even more composable. For example, it should be possible for the JSON encoder/decoder to internally use a generic upb base64 encoder/decoder to put binary data into string fields. It should also be easier to compose pipeline elements. We’re not quite to that level of composability yet.
need to implement a parser for .proto files, so we can parse them directly. Right now we can only load binary descriptors.
Expand to more formats! XML, HTTP, CSV, LevelDB log format, YAML, and more!
My most lofty vision, resurrect Gazelle, my parser generator, as an easy way of generating fast parsers that target upb natively. Think yacc/Bison/ANTLR, but more convenient, because you’ll be able to easily use Protocol Buffer in-memory objects as your parse tree / AST.

So there is still a lot to do, but it’s exciting to see it really working for Ruby already.

The name

As you can see from what I described above, upb is outgrowing its name, since it’s no longer just about Protocol Buffers. I’ve been playing with calling it “Unleaded” instead, as a backronym for upb (get it?), but I can’t seem to get this to stick.