upb status and preliminary performance numbers

The last few weeks have been very exciting for upb. On April 1 I checked in a JIT compiler for parsing protobufs, which one might think was an April Fool’s joke, but it’s real and the performance numbers so far have exceeded my expectations.

Why JIT?

Before I get to the numbers, I should explain what it even means for upb to have a JIT. If you’re not interested in the technical details, feel free to skip this section to go straight to the impacts and what this means for upb.

So why a JIT? After all, protobufs are not an imperative language. You can’t write an algorithm in a .proto file. You can’t apply compiler techniques like SSA, strength reduction, tracing, or really any of the things you’d expect from a JIT on a platform like the JVM. There is no byte code or intermediate representation to speak of. So why would upb have a JIT and what on earth would it do?

To review, a .proto file defines a schema for some messages. In this sense, it is comparable to JSON Schema or XML Schema.

message Person {
  required int32 id = 1;
  required string name = 2;
  optional string email = 3;
}

To support forwards and backwards compatibility, we don’t make any assumptions about which fields we’re going to see or in what order. Instead, on the wire each field is preceded by its tag number. Logically, the serialized version is something like:

- field number: 1, wire type: varint, value: 1
- field number 2, wire type: delimited, value: "Josh Haberman"
- field number 3, wire type: delimited, value: "jhaberman@gmail.com"

If you were going to write a parser for this, you might create a table of fields, keyed by field number, and a main parser loop that looks something like this:

while (!done) {
  int type = fields[get_field_number()];
  switch (type) {
    case WIRE_TYPE_VARINT:
      parse_varint();
      break;
    case WIRE_TYPE_DELIMITED:
      parse_delimited();
      break;
    // etc.
  }
}

We call this a “table-based parser.” We look up the type in a table of fields, and use that to branch to the correct value parsing function. There are minor variations on this, like using a function pointer instead of a switch statement, but it’s the same general idea. To see an actual program that implements this style of parser, check out my bare-bones 100-line protobuf parser that I wrote a few years ago: pb.c.

In a way this resembles a byte-code interpreter, and the serialized protobuf resembles byte-code. For example, see the main loop of the Lua interpreter. It has a similar pattern where it extracts an opcode and uses that to branch into a giant switch statement.

It turns out that if we generate code that is specific to one message type, and takes advantage of the fact that fields are usually encoded in order, we can beat this table-based parser significantly. Google’s main “protobuf” software has done this for a long time: their strategy is to generate C++ classes that are specific to a proto message type (like Person in the above example). This code has been highly tuned and optimized, and the generated parsing code looks something like this:

while((tag = input->ReadTag()) != 0) {
  field_number = GetFieldNumber(tag);
  wire_type = GetWireType(tag);
  switch(tag) {
    // optional float field1 = 1;
    case 1:
      if (wire_type != FLOAT_WIRE_TYPE) goto error;
      input->ParseFloat();
      // Saves the switch() and type check branch if fields are in order.
      if (input->ExpectTag(2, DOUBLE_WIRE_TYPE)) goto parse_field2;
      break;

    // optional double field2 = 2;
    case 2:
      if (wire_type != DOUBLE_WIRE_TYPE) goto error;
parse_field2:
      input->ParseDouble();
      // saves the switch() and type check branch if fields are in order.
      if (input->ExpectTag(3, INT32_WIRE_TYPE)) goto parse_field3
      break;

    // optional int32 field3 = 3;
    case 3:
    // [...]
  }
}

This code also has a big switch() statement at the top level, but the targets are field numbers instead of wire types. More importantly, each field’s case has a line like this:

if (input->ExpectTag(3, INT32_WIRE_TYPE)) goto parse_field3

This is key, because if the fields occur in order (which they usually do) then this branch will always be taken and it will be predicted correctly by the CPU. This can make a huge difference.

While this kind of C++ code generation has benefited Google’s protobuf software in speed, I always found it inconvenient from a dynamic languages perspective. What dynamic language user wants to have to generate C++ code and link that into their interpreter? We’re not talking about a single C++ extension that you compile once: every single message you define in a .proto file generates different C++ code. So you can’t just apt-get install libprotobuf-python and be done with it, you have to generate C++ for every message that your specific app wants to use.

This is where the JIT comes in. If we can generate machine code at runtime, we can get the speed benefits of the generated code without having to generate, compile, and link C++ into our interpreters or VMs. You could compile the library once, and after that you can dynamically load message definitions but still get the fastest possible parsing.

Preliminary results

I knew all of the theory I just described before I started writing a JIT. But theory is just that – theory. What would actually happen when I implemented a JIT?

The results exceeded my expectations. I still am being somewhat cautious, because there are so many dimensions to any performance equation, and because benchmarks are not guaranteed to correspond to real-world performance. That said, here are my preliminary results. You can reproduce these by obtaining upb from GitHub and running make benchmarks followed by make benchmark (you have to define -DUPB_USE_JIT_X64 to get the JIT).

Parsing an 80k protobuf into a data structure repeatedly,
calling Clear() between each parse.  (proto2 == Google protobuf)

proto2 table-based parser      38 MB/s
proto2 generated code parser   265 MB/s
upb table-based parser         340 MB/s
upb JIT parser                 741 MB/s

[Ed (May 2015). In retrospect, the numbers above had more to do with upb’s data structure than the speed of the parsing code. The proto2 implementation of Clear() recurses over the entire message tree, whereas upb was using a data structure that avoided this. upb no longer includes any data structures at all, preferring to be just a stream parser. upb’s JIT, though, is speed-competitive with protbuf’s highly-optimized generated C++ parsing code.]

The results above are designed to be as apples-to-apples as possible. In other words, I disabled upb’s optimization that avoids memcpy() because Google’s protobuf doesn’t support it in its open-source release. I think the reason that even my table-based parser beats Google’s generated code here is because proto2 implements Clear() in a sub-optimal way that requires an extra pass over the message tree; see my comment in upb_msg.h for more info.

Things get even better if we allow ourselves to drop the constraint of being apples-to-apples with proto2. upb is capable of stream-based parsing like SAX, whereas proto2 only supports a DOM-based approach where you parse into a data structure. If we include the performance numbers for stream parsing, and for DOM-based parsing that avoids memcpy, we get:

upb table-based stream parser    420 MB/s
upb JIT no-memcpy() DOM parser   870 MB/s
upb JIT stream parser           1430 MB/s

[Ed (May 2015). These numbers do accurately reflect how fast upb’s JIT parser can be with streams. However, protobuf is benchmarking faster these days. The speed advantage of upb’s stream parser over protobuf is more in the realm of 2x these days]

If you’re like me when I first saw these numbers, your jaw is on the floor at seeing almost 1.5GB/s doing stream parsing of protocol buffers. At this point we are 5x proto2’s highly optimized generated code.

upb’s JIT isn’t complete and can’t handle all cases, but these performance numbers should still be valid because these benchmarks only used the parts of the format that are currently supported.

So when can I USE it? And can I help?

One reason I’ve avoided posting these extremely positive results so far is that I hate to get people excited about something that’s still not ready for users yet.

Adding support for the JIT required making extremely large and intrusive changes to the core interfaces, like this 3000 line refactoring that had to be completed before I could write even the first line of the JIT. I need to have the flexibility to change core interfaces like this still, because the design is still converging. So as much as I wish I could say it’s ready to go, I still need to hold off on a real release.

The good news is that the design is still making huge strides. In the last few weeks I’ve been refining my scheme for how upb will integrate into different VM’s and language runtimes, and I feel more confident than ever that the language bindings for Lua, Python, etc. will be some of the fastest extensions ever offered for this kind of functionality.

As a preview for where this is going, I think that upb will even be usable as a JSON library, offering speed that is as good or greater than any existing JSON libraries. JSON can be directly mapped onto a subset of protobufs (JSON only uses double-precision numbers) and the protobuf text format is already extremely similar to JSON. So all the work I’ve done to optimize memory layout and dynamic language object access should apply.

And while I’m really happy to get offers to help out, it’s still at a stage of design where I need to be doing most of the work. Working on upb generally involves pacing around my apartment deep in thought about all the requirements and use cases I want to satisfy and brainstorming a million different approaches until I converge on the the one that is the smallest, fastest, and most flexible. It’s hard work but I love it, and the more time that passes the more convinced I am that this is going to be big.