upb status and preliminary performance numbers

The last few weeks have been very exciting for upb. On April 1 I checked in a JIT compiler for parsing protobufs, which one might think was an April Fool’s joke, but it’s real and the performance numbers so far have exceeded my expectations.

Why JIT?

Before I get to the numbers, I should explain what it even means for upb to have a JIT. If you’re not interested in the technical details, feel free to skip this section to go straight to the impacts and what this means for upb.

So why a JIT? After all, protobufs are not an imperative language. You can’t write an algorithm in a .proto file. You can’t apply compiler techniques like SSA, strength reduction, tracing, or really any of the things you’d expect from a JIT on a platform like the JVM. There is no byte code or intermediate representation to speak of. So why would upb have a JIT and what on earth would it do?

To review, a .proto file defines a schema for some messages. In this sense, it is comparable to JSON Schema or XML Schema.

message Person {
  required int32 id = 1;
  required string name = 2;
  optional string email = 3;
}

To support forwards and backwards compatibility, we don’t make any assumptions about which fields we’re going to see or in what order. Instead, on the wire each field is preceded by its tag number. Logically, the serialized version is something like:

- field number: 1, wire type: varint, value: 1
- field number 2, wire type: delimited, value: "Josh Haberman"
- field number 3, wire type: delimited, value: "jhaberman@gmail.com"

If you were going to write a parser for this, you might create a table of fields, keyed by field number, and a main parser loop that looks something like this:

while (!done) {
  int type = fields[get_field_number()];
  switch (type) {
    case WIRE_TYPE_VARINT:
      parse_varint();
      break;
    case WIRE_TYPE_DELIMITED:
      parse_delimited();
      break;
    // etc.
  }
}

We call this a “table-based parser.” We look up the type in a table of fields, and use that to branch to the correct value parsing function. There are minor variations on this, like using a function pointer instead of a switch statement, but it’s the same general idea. To see an actual program that implements this style of parser, check out my bare-bones 100-line protobuf parser that I wrote a few years ago: pb.c.

In a way this resembles a byte-code interpreter, and the serialized protobuf resembles byte-code. For example, see the main loop of the Lua interpreter. It has a similar pattern where it extracts an opcode and uses that to branch into a giant switch statement.

It turns out that if we generate code that is specific to one message type, and takes advantage of the fact that fields are usually encoded in order, we can beat this table-based parser significantly. Google’s main “protobuf” software has done this for a long time: their strategy is to generate C++ classes that are specific to a proto message type (like Person in the above example). This code has been highly tuned and optimized, and the generated parsing code looks something like this:

while((tag = input->ReadTag()) != 0) {
  field_number = GetFieldNumber(tag);
  wire_type = GetWireType(tag);
  switch(tag) {
    // optional float field1 = 1;
    case 1:
      if (wire_type != FLOAT_WIRE_TYPE) goto error;
      input->ParseFloat();
      // Saves the switch() and type check branch if fields are in order.
      if (input->ExpectTag(2, DOUBLE_WIRE_TYPE)) goto parse_field2;
      break;

    // optional double field2 = 2;
    case 2:
      if (wire_type != DOUBLE_WIRE_TYPE) goto error;
parse_field2:
      input->ParseDouble();
      // saves the switch() and type check branch if fields are in order.
      if (input->ExpectTag(3, INT32_WIRE_TYPE)) goto parse_field3
      break;

    // optional int32 field3 = 3;
    case 3:
    // [...]
  }
}

This code also has a big switch() statement at the top level, but the targets are field numbers instead of wire types. More importantly, each field’s case has a line like this:

if (input->ExpectTag(3, INT32_WIRE_TYPE)) goto parse_field3

This is key, because if the fields occur in order (which they usually do) then this branch will always be taken and it will be predicted correctly by the CPU. This can make a huge difference.

While this kind of C++ code generation has benefited Google’s protobuf software in speed, I always found it inconvenient from a dynamic languages perspective. What dynamic language user wants to have to generate C++ code and link that into their interpreter? We’re not talking about a single C++ extension that you compile once: every single message you define in a .proto file generates different C++ code. So you can’t just apt-get install libprotobuf-python and be done with it, you have to generate C++ for every message that your specific app wants to use.

This is where the JIT comes in. If we can generate machine code at runtime, we can get the speed benefits of the generated code without having to generate, compile, and link C++ into our interpreters or VMs. You could compile the library once, and after that you can dynamically load message definitions but still get the fastest possible parsing.

Preliminary results

I knew all of the theory I just described before I started writing a JIT. But theory is just that — theory. What would actually happen when I implemented a JIT?

The results exceeded my expectations. I still am being somewhat cautious, because there are so many dimensions to any performance equation, and because benchmarks are not guaranteed to correspond to real-world performance. That said, here are my preliminary results. You can reproduce these by obtaining upb from GitHub and running make benchmarks followed by make benchmark (you have to define -DUPB_USE_JIT_X64 to get the JIT).

Parsing an 80k protobuf into a data structure repeatedly,
calling Clear() between each parse.  (proto2 == Google protobuf)

proto2 table-based parser      38 MB/s
proto2 generated code parser   265 MB/s
upb table-based parser         340 MB/s
upb JIT parser                 741 MB/s

The results above are designed to be as apples-to-apples as possible. In other words, I disabled upb’s optimization that avoids memcpy() because Google’s protobuf doesn’t support it in its open-source release. I think the reason that even my table-based parser beats Google’s generated code here is because proto2 implements Clear() in a sub-optimal way that requires an extra pass over the message tree; see my comment in upb_msg.h for more info.

Things get even better if we allow ourselves to drop the constraint of being apples-to-apples with proto2. upb is capable of stream-based parsing like SAX, whereas proto2 only supports a DOM-based approach where you parse into a data structure. If we include the performance numbers for stream parsing, and for DOM-based parsing that avoids memcpy, we get:

upb table-based stream parser    420 MB/s
upb JIT no-memcpy() DOM parser   870 MB/s
upb JIT stream parser           1430 MB/s

If you’re like me when I first saw these numbers, your jaw is on the floor at seeing almost 1.5GB/s doing stream parsing of protocol buffers. At this point we are 5x proto2′s highly optimized generated code.

upb’s JIT isn’t complete and can’t handle all cases, but these performance numbers should still be valid because these benchmarks only used the parts of the format that are currently supported.

So when can I USE it? And can I help?

One reason I’ve avoided posting these extremely positive results so far is that I hate to get people excited about something that’s still not ready for users yet.

Adding support for the JIT required making extremely large and intrusive changes to the core interfaces, like this 3000 line refactoring that had to be completed before I could write even the first line of the JIT. I need to have the flexibility to change core interfaces like this still, because the design is still converging. So as much as I wish I could say it’s ready to go, I still need to hold off on a real release.

The good news is that the design is still making huge strides. In the last few weeks I’ve been refining my scheme for how upb will integrate into different VM’s and language runtimes, and I feel more confident than ever that the language bindings for Lua, Python, etc. will be some of the fastest extensions ever offered for this kind of functionality.

As a preview for where this is going, I think that upb will even be usable as a JSON library, offering speed that is as good or greater than any existing JSON libraries. JSON can be directly mapped onto a subset of protobufs (JSON only uses double-precision numbers) and the protobuf text format is already extremely similar to JSON. So all the work I’ve done to optimize memory layout and dynamic language object access should apply.

And while I’m really happy to get offers to help out, it’s still at a stage of design where I need to be doing most of the work. Working on upb generally involves pacing around my apartment deep in thought about all the requirements and use cases I want to satisfy and brainstorming a million different approaches until I converge on the the one that is the smallest, fastest, and most flexible. It’s hard work but I love it, and the more time that passes the more convinced I am that this is going to be big.

Posted in upb | 7 Comments

ARM architecture is mired in oppressive legalese

My Efika MX Smarttop came and I’ve had some fun compiling upb for it and trying it out. I don’t have a pressing ARM use case yet, but I want to work ahead a bit and get familiar with the architecture so that when I do want to program for it I’m not starting from scratch.

I went to the ARM website to get my hands on some documentation. I was prepared to buy physical books if that is their authoritative documentation, but it appears that they have PDFs for the architecture’s reference manuals.

So far so good — this is on par with Intel which has a nice and easily accessible website where you can download any of their manuals in like 5 seconds. I’ve downloaded them and referenced them a countless number of times.

But I get no such love from ARM. Strike 1: the ARM website won’t let you download the manuals unless you have registered first. Not cool ARM, not cool. I grudgingly give them my name, email address, company name, country, and state. And the website warns:

Note: We recommend using your business email address to ensure you can access all your relevant services

This is a vague but ominous warning that if you use a personal email address the registration might not work correctly.

But fine, I go through with the registration. Now I can download the manual, right? Turns out no: first I have to accept a EULA! It begins:

USER AGREEMENT FOR THE ARM ARCHITECTURE REFERENCE MANUAL

THIS AGREEMENT (” AGREEMENT “) IS A LEGAL AGREEMENT BETWEEN YOU (EITHER A SINGLE INDIVIDUAL, OR SINGLE LEGAL ENTITY) AND ARM LIMITED (“ARM”) FOR THE USE OF THE ARM ARCHITECTURE REFERENCE MANUAL. ARM IS ONLY WILLING TO PROVIDE ACCESS TO THE ARM ARCHITECTURE REFERENCE MANUAL TO YOU ON CONDITION THAT YOU ACCEPT ALL OF THE TERMS IN THIS AGREEMENT. BY CLICKING “I AGREE” OR BY DOWNLOADING OR OTHERWISE COPYING THE DELIVERABLES YOU INDICATE THAT YOU AGREE TO BE BOUND BY ALL THE TERMS OF THIS LICENCE.

This is all just to read some documentation. Not impressed. ARM, We’re off to a rocky start.

Posted in Uncategorized | Leave a comment

EINTR and PC loser-ing (The “Worse Is Better” case study)

Richard Gabriel’s 1989 essay Worse Is Better is a famous comparison between LISP and Unix/C that pops up from time to time and is guaranteed to spark a spirited discussion. The philosophical argument itself is not something I want to get into right now; I am interested in the technical content of the essay. What always bothered me about this paper is that I never fully understood Gabriel’s primary example of a dirty hack vs. “the right thing.”

His example is “the PC loser-ing problem,” which he describes thus:

Two famous people, one from MIT and another from Berkeley (but working on Unix) once met to discuss operating system issues. The person from MIT was knowledgeable about ITS (the MIT AI Lab operating system) and had been reading the Unix sources. He was interested in how Unix solved the PC loser-ing problem. The PC loser-ing problem occurs when a user program invokes a system routine to perform a lengthy operation that might have significant state, such as IO buffers. If an interrupt occurs during the operation, the state of the user program must be saved. Because the invocation of the system routine is usually a single instruction, the PC of the user program does not adequately capture the state of the process. The system routine must either back out or press forward. The right thing is to back out and restore the user program PC to the instruction that invoked the system routine so that resumption of the user program after the interrupt, for example, re-enters the system routine. It is called “PC loser-ing” because the PC is being coerced into “loser mode,” where “loser” is the affectionate name for “user” at MIT.

The MIT guy did not see any code that handled this case and asked the New Jersey guy how the problem was handled. The New Jersey guy said that the Unix folks were aware of the problem, but the solution was for the system routine to always finish, but sometimes an error code would be returned that signaled that the system routine had failed to complete its action. A correct user program, then, had to check the error code to determine whether to simply try the system routine again. The MIT guy did not like this solution because it was not the right thing.

The New Jersey guy said that the Unix solution was right because the design philosophy of Unix was simplicity and that the right thing was too complex.

When I read this I always had a burning desire to know: how did the story end? How do modern operating systems resolve this problem — the “dirty hack” way or the “right way?” What part of our modern POSIX interfaces are affected by this question?

There are several things that never made sense to me about this example. First of all, why would you need to abort a system call just because an interrupt occurred? I investigated the Linux source and it seems quite clear that interrupt handlers can return to either the kernel or userspace — whichever was running when the interrupt fired. So I don’t see why you’d need to “coerce” the system into “loser mode” at all.

But let’s suppose you accept this as a given — we will assume that when a hardware interrupt occurs, you must exit to user mode. I still don’t see the difficulty in automatically re-invoking the system call. It’s true that invoking the system routine is a single instruction, but why is it that “the PC of the user program does not adequately capture the state of the process,” as Gabriel’s essay states? What other process state do we need to capture? The registers must already be saved when the syscall is entered, because they must be restored even with a completely normal syscall return. So if we want to re-invoke the system routine, it should be as easy as simply re-executing the instruction that made the system call. Right?

The whole example confused me quite a lot until I had the idea to replace “interrupt” in the above description with “signal.” This is not such a stretch, since signals are essentially user-space software interrupts. With this small change, everything started to make a lot more sense. If a signal was delivered to a process that was currently inside a system call, that signal handler could invoke a system call itself, which would cause us to re-enter the kernel. I could easily see how the complexity of dealing with this could have led early UNIX implementors to simply abort the original system call before delivering the signal.

But this is only speculation about what UNIX was like in the mid to late 80s when “Worse is Better” was written. I could be completely off the mark in this analysis — maybe returning to the kernel from a hardware interrupt handler really wasn’t implemented at that time. Or maybe saving user state really was difficult for some reason. I’d love to hear from anyone who has more historical context about this. But the essay contains an important clue that seems to reinforce my speculation that it’s actually about signals.

UNIX and EINTR

If we look closely at the “Worse is Better” essay, we get a strong clue about what the Unix guy in the story might have been talking about:

The New Jersey guy said that the Unix folks were aware of the problem, but the solution was for the system routine to always finish, but sometimes an error code would be returned that signaled that the system routine had failed to complete its action. A correct user program, then, had to check the error code to determine whether to simply try the system routine again.

As someone who has done a lot of Unix system-level programming, this sounds to me like it must be describing EINTR, the error code in Unix that means “Interrupted system call.” To give a quick description of EINTR I’ll enlist the help of my trusty copy of “Advanced Programming in the Unix Environment” by W. Richard Stevens:

A characteristic of earlier UNIX systems is that if a process caught a signal while the process was blocked in a “slow” system call, the system call was interrupted. The system call returned an error and errno was set to EINTR. This was done under the assumption that since a signal occurred and the process caught it, there is a good chance that something has happened that should wake up the blocked system call.

[...]

The problem with interrupted system calls is that we now have to handle the error return explicitly. The typical code sequence (assuming a read operation and assuming that we want to restart the read even if it’s interrupted) would be

again:
  if ((n = read(fd, buf, BUFFSIZE)) < 0) {
    if (errno == EINTR)
      goto again;  /* just an interrupted system call */
    /* handle other errors */
  }

This sounds an awful lot like the the New Jersey guy's approach from the story, which required a correct program "to check the error code to determine whether to simply try the system routine again." And there's nothing else in Unix that I've ever heard of that's anything like this. This must be what the New Jersey guy from the story was talking about!

But note that in W. Richard Stevens' explanation this isn't some dirty hack! It's not a case of cutting corners that is justified by favoring implementation simplicity over interface simplicity. Stevens describes it as a deliberate design decision that gives users the capability to abort a long-running system call if you catch a signal in the meantime. Now you could easily see this as a rationalization of a dirty hack ("it's not a bug, it's a feature!"), but it certainly seems plausible that if you catch a signal while you're blocked on a long system call, the signal might make you decide that you don't want to wait for the long system call any more. Indeed, Ulrich Drepper claimed in 2000 that "Returning EINTR is necessary for many applications," though it would have been helpful if he had expanded on this point by giving some examples.

Of course, the price we have paid for this capability is that we have to wrap all of our potentially long system calls in a loop like the example above. If we don't, our system calls can start failing and causing program errors whenever we catch a signal. You may think that you don't use any signals yourself, but are you sure that none of your libraries do? On the flip side, if you're implementing a library you can never know if the main application will use signals or not, so any library that wants to be robust will have to wrap these system calls in a retry loop.

Since the vast majority of programs will always want their system calls to continue even when a signal is received, 4.2BSD (released in 1983) implemented support for automatically retrying most system calls that could previously fail with EINTR. To me this sounds exactly like what the MIT guy in Richard Gabriel's story was saying is "the right thing." In other words, Berkeley UNIX was already doing "the right thing" five years before "Worse is Better" was written!

Modern POSIX APIs allow both behaviors (either restarting the system call automatically or returning EINTR) -- this is controlled by the SA_RESTART flag. The following program illustrates both behaviors:


#include 
#include 
#include 
#include 
#include 

void doread() {
  char buf[128];
  printf("doing read() into buf %p\n", buf);
  ssize_t ret = read(STDIN_FILENO, buf, sizeof(buf));
  if (ret < 0) {
    printf("read() for buf %p returned error: %s\n", buf, strerror(errno));
  } else {
    printf("read() for buf %p returned data: %.*s", buf, (int)ret, buf);
  }
}

void sighandler(int signo) {
  printf("received signal %d\n", signo);
  doread();
}

int main(int argc, char *argv[]) {
  // Register SIGHUP handler.  Pass any argument to get SA_RESTART.
  struct sigaction action;
  action.sa_handler = &sighandler;
  sigemptyset(&action.sa_mask);
  action.sa_flags = (argc > 1) ? SA_RESTART : 0;
  sigaction(SIGHUP, &action, NULL);

  doread();
  return 0;
}

Here are the results of running the program three different times. I've bolded the parts where I typed to give the program input on stdin. You can also see where I sent the program a SIGHUP.

$ ./test
doing read() into buf 0x7ec7959c
INPUT FROM TERMINAL
read() for buf 0x7ec7959c returned data: INPUT FROM TERMINAL
$ ./test
doing read() into buf 0x7ef6659c
received signal 1
doing read() into buf 0x7ef66204
INPUT FROM TERMINAL
read() for buf 0x7ef66204 returned data: INPUT FROM TERMINAL
read() for buf 0x7ef6659c returned error: Interrupted system call
$ ./test give_me_sa_restart
doing read() into buf 0x7eb7657c
received signal 1
doing read() into buf 0x7eb761e4
INPUT FROM TERMINAL
read() for buf 0x7eb761e4 returned data: INPUT FROM TERMINAL
INPUT FROM TERMINAL AGAIN
read() for buf 0x7eb7657c returned data: INPUT FROM TERMINAL AGAIN

Conclusion

You might ask "why all the fuss over a little example?" As I mentioned, my primary motivation in researching all of this was to get to the bottom of this issue and understand how it plays out in modern operating systems.

But if we were going to take all of this information and reflect on the "Worse is Better" argument, my personal observations/conclusions would be:

  • The "worse" system (Unix) did indeed do "the right thing" eventually, even if it didn't at first. "Worse is better" systems incrementally improve by responding to user needs. Since users got tired of checking for EINTR, the "worse" system added the functionality for addressing this pain point.
  • The whole thing did leave a rather large wart, though -- all Unix programs have to wrap these system calls in an EINTR retry loop unless they can be absolutely sure the process will never catch signals that don't have SA_RESTART set. So there is a price to pay for this incremental evolution.
Posted in Uncategorized | 4 Comments

Addicted to hardware: new toy on the way

I’m addicted to hardware. I can’t stop thinking about all of the CPUs that currently exist, how they compare to each other, and how to write the fastest possible code on them. (Actually I want to learn FPGA programming too, in case they ever start bundling FPGAs with computers).

A year and a half ago I bought a SheevaPlug which is a little ARM computer that’s hardly bigger than a wall wart AC adapter. It has USB and Ethernet and that’s about it (no display). Unfortunately when I tried to start playing with it again tonight I discovered it was bricked thanks to a faulty power supply, which is apparently a very common problem with SheevaPlugs.

I was a bit sad about this, and I started looking for alternatives. I wanted something that runs on a non-x86 architecture and that I could stick in a closet and SSH to. I discovered that there is a new form factor of computers known as a nettop. I found a nettop that appears to fulfill my wishes perfectly: the EFIKA MX Smarttop made by a company called Genesi. It’s surprisingly capable for $129: it’s powered by an ARM Cortex-A8 800MHz CPU, has 512MB RAM, 8GB of flash, Ethernet, WiFi, Bluetooth, USB, and 720p video output through HDMI (with hardware-accelerated video decoding). Not bad for $129! And it has a maximum power consumption of 15W.

So I ordered one — will look forward to seeing if it is really all it’s stacked up to be!

Posted in Uncategorized | 4 Comments

When a compiler’s slow code actually bites you

A few days ago I posted GCC: the impressive and the disappointing where I looked at some cases where GCC produces not-quite-optimal code. One of the comments on that post was (emphasis mine):

So, it seems like there is a much better way to give the compiler a shot at doing the right thing: [snip suggestion]. I think you will find the compiler will generate quite efficient code in this case, particularly if you look at the real execution overhead, rather than what the assembler looks like.

This is a common attitude I encounter when I am discussing my attempts to optimize my protocol buffer decoding library upb. Programmers love to tell other programmers that they are prematurely optimizing, and most of the time they’re right. I’m sure to some people it seems ludicrous that I would be looking at assembly language output to determine whether it is efficient enough. For 99.99% of programs, it would be. But I’m working in one of those rare domains where it actually matters. And today I encountered pretty convincing evidence that the compiler’s bad code is actually affecting me.

The compiler’s bad code in this case is an example of a bug I previously filed on GCC: struct returned by value generates useless stores. Though I had previously observed that bug only by inspecting assembly language output, today I had it show up on an actual profile as clear as day. Here is a screenshot from Shark (click to get full-size):

Screenshot from Apple Shark showing the bad code.

To summarize, the compiler took the code:

typedef struct {
  upb_flow_t flow;  // An enum defined elsewhere.
  void *closure;
} upb_sflow_t;

upb_flow_t upb_dispatch_startsubmsg([...]) {
  // [...]
  upb_sflow_t sflow = f->cb.startsubmsg([...]);
  if (sflow.flow != UPB_CONTINUE) {
    // [...]
  }

…and turned that function call/test into this awful machine code (here in its Intel-syntax form):

  call   QWORD PTR [r12 + 16]
  mov    DWORD PTR [rbp - 64], eax
  mov    QWORD PTR [rbp - 56], rdx
  mov    rax, QWORD PTR [rbp - 64]    ; loads rax with data it already has.
  mov    QWORD PTR [rbp - 48], rax    ; stores rax into the stack a second time.
  mov    QWORD PTR [rbp - 40], rdx    ; stores rdx into the stack a second time.
  mov    edx, DWORD PTR [rbp - 48]    ; loads edx with data already in rax.
  testl  edx, edx

..and then (this is the important part) in an actual profile it shows up as being 43.4% of the execution time of a hot function in my program.

This is not a slam against the GCC developers. GCC is a big and complex piece of software, and they have to prioritize all sorts of different bugs, feature requests, new hardware, etc.

This is just a reminder to those who jump to dare-I-say “premature” conclusions about what is premature optimization: some of us really are working in domains where things like virtual function overhead, branch predictability, and the efficiency of the compiler’s code make a difference.

Posted in Uncategorized | 1 Comment