upb status and preliminary performance numbers
The last few weeks have been very exciting for upb. On April 1 I checked in a JIT compiler for parsing protobufs, which one might think was an April Fool’s joke, but it’s real and the performance numbers so far have exceeded my expectations.
Why JIT?
Before I get to the numbers, I should explain what it even means for upb to have a JIT. If you’re not interested in the technical details, feel free to skip this section to go straight to the impacts and what this means for upb.
So why a JIT? After all, protobufs are not an imperative language.
You can’t write an algorithm in a .proto
file. You can’t apply
compiler techniques like SSA, strength reduction, tracing, or really
any of the things you’d expect from a JIT on a platform like the JVM.
There is no byte code or intermediate representation to speak of. So
why would upb have a JIT and what on earth would it do?
To review, a .proto
file defines a schema for some messages. In
this sense, it is comparable to JSON
Schema or XML Schema.
To support forwards and backwards compatibility, we don’t make any assumptions about which fields we’re going to see or in what order. Instead, on the wire each field is preceded by its tag number. Logically, the serialized version is something like:
If you were going to write a parser for this, you might create a table of fields, keyed by field number, and a main parser loop that looks something like this:
We call this a “table-based parser.” We look up the type in a table of fields, and use that to branch to the correct value parsing function. There are minor variations on this, like using a function pointer instead of a switch statement, but it’s the same general idea. To see an actual program that implements this style of parser, check out my bare-bones 100-line protobuf parser that I wrote a few years ago: pb.c.
In a way this resembles a byte-code interpreter, and the serialized protobuf resembles byte-code. For example, see the main loop of the Lua interpreter. It has a similar pattern where it extracts an opcode and uses that to branch into a giant switch statement.
It turns out that if we generate code that is specific to one message
type, and takes advantage of the fact that fields are usually encoded
in order, we can beat this table-based parser significantly. Google’s
main “protobuf” software has done this for a long time: their strategy
is to generate C++ classes that are specific to a proto message type
(like Person
in the above example). This code has been highly tuned
and optimized, and the generated parsing code looks something like
this:
This code also has a big switch()
statement at the top level, but
the targets are field numbers instead of wire types. More
importantly, each field’s case has a line like this:
This is key, because if the fields occur in order (which they usually do) then this branch will always be taken and it will be predicted correctly by the CPU. This can make a huge difference.
While this kind of C++ code generation has benefited Google’s protobuf
software in speed, I always found it inconvenient from a dynamic
languages perspective. What dynamic language user wants to have to
generate C++ code and link that into their interpreter? We’re not
talking about a single C++ extension that you compile once: every
single message you define in a .proto file generates different C++
code. So you can’t just apt-get install libprotobuf-python
and be
done with it, you have to generate C++ for every message that your
specific app wants to use.
This is where the JIT comes in. If we can generate machine code at runtime, we can get the speed benefits of the generated code without having to generate, compile, and link C++ into our interpreters or VMs. You could compile the library once, and after that you can dynamically load message definitions but still get the fastest possible parsing.
Preliminary results
I knew all of the theory I just described before I started writing a JIT. But theory is just that – theory. What would actually happen when I implemented a JIT?
The results exceeded my expectations. I still am being somewhat
cautious, because there are so many dimensions to any performance
equation, and because benchmarks are not guaranteed to correspond to
real-world performance. That said, here are my preliminary results.
You can reproduce these by obtaining upb from GitHub
and running make benchmarks
followed by make benchmark
(you have
to define -DUPB_USE_JIT_X64
to get the JIT).
[Ed (May 2015). In retrospect, the numbers above had more to do with
upb’s data structure than the speed of the parsing code. The proto2
implementation of Clear()
recurses over the entire message tree,
whereas upb was using a data structure that avoided this. upb no
longer includes any data structures at all, preferring to be just a
stream parser. upb’s JIT, though, is speed-competitive with
protbuf’s highly-optimized generated C++ parsing code.]
The results above are designed to be as apples-to-apples as possible.
In other words, I disabled upb’s optimization that avoids memcpy()
because Google’s protobuf doesn’t support it in its open-source
release. I think the reason that even my table-based parser beats
Google’s generated code here is because proto2 implements Clear() in a
sub-optimal way that requires an extra pass over the message tree; see
my
comment in upb_msg.h for more info.
Things get even better if we allow ourselves to drop the constraint of being apples-to-apples with proto2. upb is capable of stream-based parsing like SAX, whereas proto2 only supports a DOM-based approach where you parse into a data structure. If we include the performance numbers for stream parsing, and for DOM-based parsing that avoids memcpy, we get:
[Ed (May 2015). These numbers do accurately reflect how fast upb’s JIT parser can be with streams. However, protobuf is benchmarking faster these days. The speed advantage of upb’s stream parser over protobuf is more in the realm of 2x these days]
If you’re like me when I first saw these numbers, your jaw is on the floor at seeing almost 1.5GB/s doing stream parsing of protocol buffers. At this point we are 5x proto2’s highly optimized generated code.
upb’s JIT isn’t complete and can’t handle all cases, but these performance numbers should still be valid because these benchmarks only used the parts of the format that are currently supported.
So when can I USE it? And can I help?
One reason I’ve avoided posting these extremely positive results so far is that I hate to get people excited about something that’s still not ready for users yet.
Adding support for the JIT required making extremely large and intrusive changes to the core interfaces, like this 3000 line refactoring that had to be completed before I could write even the first line of the JIT. I need to have the flexibility to change core interfaces like this still, because the design is still converging. So as much as I wish I could say it’s ready to go, I still need to hold off on a real release.
The good news is that the design is still making huge strides. In the last few weeks I’ve been refining my scheme for how upb will integrate into different VM’s and language runtimes, and I feel more confident than ever that the language bindings for Lua, Python, etc. will be some of the fastest extensions ever offered for this kind of functionality.
As a preview for where this is going, I think that upb will even be usable as a JSON library, offering speed that is as good or greater than any existing JSON libraries. JSON can be directly mapped onto a subset of protobufs (JSON only uses double-precision numbers) and the protobuf text format is already extremely similar to JSON. So all the work I’ve done to optimize memory layout and dynamic language object access should apply.
And while I’m really happy to get offers to help out, it’s still at a stage of design where I need to be doing most of the work. Working on upb generally involves pacing around my apartment deep in thought about all the requirements and use cases I want to satisfy and brainstorming a million different approaches until I converge on the the one that is the smallest, fastest, and most flexible. It’s hard work but I love it, and the more time that passes the more convinced I am that this is going to be big.