Refcounting immutable cyclic graphs

Cycles are a thorn in the side of refcounting. But yesterday I discovered (or perhaps rediscovered) a simple, efficient scheme for refcounting cyclic graphs if the defs are immutable: find strongly-connected components and make all nodes in each SCC share a refcount. Beautiful.

This relies on the graph theory result that if each SCC in a graph is replaced with a single node, the resulting graph forms a directed acyclic graph.

I’m planning to use this scheme to refcount defs in upb (garbage-collection is a non-starter because I don’t want to have to track a global list of roots or force the client to periodically call a GC function).

Posted in Uncategorized | 4 Comments

Making Knuth’s wish come true: the x32 ABI

Several years ago (though I can’t say exactly how many since it’s not dated) Knuth made the following complaint:

A Flame About 64-bit Pointers

It is absolutely idiotic to have 64-bit pointers when I compile a program that uses less than 4 gigabytes of RAM. When such pointer values appear inside a struct, they not only waste half the memory, they effectively throw away half of the cache.

The gcc manpage advertises an option “-mlong32″ that sounds like what I want. Namely, I think it would compile code for my x86-64 architecture, taking advantage of the extra registers etc., but it would also know that my program is going to live inside a 32-bit virtual address space.

Unfortunately, the -mlong32 option was introduced only for MIPS computers, years ago. Nobody has yet adopted such conventions for today’s most popular architecture. Probably that happens because programs compiled with this convention will need to be loaded with a special version of libc.

Please, somebody, make that possible.

I always thought this made a lot of sense. People have asked distro-makers for this before without a lot of success, but it looks like this is now being worked on by high-profile people in the Linux community. It is called The x32 ABI (see the LWN coverage for a more digestible description). It’s exciting because in some benchmarks this can outperform the x86-64 ABI by 10% or more. It’s a tradeoff — if you don’t need to address more than 4GB of memory, you can get faster programs because smaller pointers have better cache utilization. You’ll use less memory too.

This could have been done in a way that operated nearly the same as “compatibility mode” (ie. running 32-bit binaries on a 64-bit CPU/OS), which would have required only minimal changes to the kernel/toolchain. But it looks like their plans are more ambitious: they want to be able to use the optimized SYSCALL64 instruction (which is “much faster” than int 0x80 according to H. Peter Anvin), and they’re looking at fixing other problems like 32-bit time_t. So it’s a more substantial effort, but it looks like there’s significant interest and momentum behind this.

Thinking about how this would affect upb, my impression is that I could use my x86-64 JIT unmodified with x32, since it appears to have all of the same calling conventions. It has the same set of callee-save registers and the same set of registers for parameter transfer, and I think these are the main things upb’s JIT-ted code depends on.

Posted in upb | Leave a comment

Beating the compiler

It’s been a while since I’ve posted about upb, but I’ve been busy improving it! I think the biggest achievement I can mention is that the core upb APIs (upb_handlers, upb_def, and upb_bytestream/upb_bytesink) are converging to the point where I’m comfortable with people starting to experiment with them. I’m not promising they won’t change at all, but I’m a lot more confident in their overall structure and semantics than I have been previously.

Notably, I think upb’s deserialization is ready for casual and experimental use. Definitely don’t trust any data to it until it’s better tested, though. I won’t be releasing until things have converged a bit more (and are better tested).

I gave a talk about upb at Google yesterday that was well-received. One question that comes up is “how are you beating the generated code from the protobuf compiler?” For the record, my JIT appears to be about 25% faster than Google’s protobuf release on a completely apples-to-apples test. It is a bit surprising, since my code has basically the same structure as protobuf’s generated C++ — I don’t invent any new optimizations or anything like that. I think it really comes down to generating better assembly than gcc is.

We live in a day and age where common wisdom is that you can’t beat a good C++ compiler, or at least not by much, and I think this is probably true for 99% of use cases. But I was first inspired to think that I could beat the C++ compiler in this case by reading this mailing list post from Mike Pall where he explains why you can still beat the compiler for interpreter main loops. A protobuf parser is surprisingly similar to a byte-code interpreter main loop, so I thought I’d give it a shot.

Below is just the simplest example I could dig up of a side-by-side comparison of my code vs. the compiler’s. What follows is the code to parse a single fixed64 value:

  ; upb JIT assembly:
  mov    rdx,QWORD PTR [rbx+0x2]    ; load fixed64 val out of buffer
  add    rbx,0xa                    ; advance buffer by 10 (2 for tag)
  mov    QWORD PTR [r12+0x40],rdx   ; store fixed64 value in message
  or     BYTE PTR [r12+0x1],0x4     ; set hasbit
  cmp    rbx,QWORD PTR [r15+0xaf8]  ; check for end-of-buffer
  jae    
  mov    rcx,QWORD PTR [rbx]        ; load next tag
  cmp    cx,0x1b0                   ; next field+wt in order?
  je     

There’s not a lot left to cut away here. Compare with protobuf/gcc-generated code:

  ; protobuf/gcc code:
  mov    ecx,DWORD PTR [rbx+0x10]   ; load buffer end
  mov    rax,QWORD PTR [rbx+0x8]    ; load buffer
  sub    ecx,eax                    ; if (buffer_end_ - buffer_ <= 7)
  cmp    ecx,0x7
  jle    
  mov    rax,QWORD PTR [rax]        ; load fixed64 val
  mov    rdx,QWORD PTR [rbp-0x48]   ; load this
  mov    QWORD PTR [rdx],rax        ; store fixed64 val in this
  add    QWORD PTR [rbx+0x8],0x8    ; advance buffer
  or     DWORD PTR [r12+0x74],0x800 ; set hasbit
  mov    rdx,QWORD PTR [rbx+0x10]   ; load buffer end
  mov    rax,QWORD PTR [rbx+0x8]    ; load buffer
  mov    ecx,edx
  sub    ecx,eax                    ; if (buffer_end_ - buffer_ <= 1)
  cmp    ecx,0x1
  jle    
  cmp    BYTE PTR [rax],0xb0        ; check first byte of next tag
  jne    
  cmp    BYTE PTR [rax+0x1],0x1     ; check second byte of next tag
  jne    

There is some poor register allocation going on here — gcc is repeatedly loading buffer_ and buffer_end_ even though it has plenty of registers to play with. All in all, gcc is generating over twice the number of instructions, over twice the number of loads, and more stores too. Note that this is taken from the middle of a very large C++ function with a big switch statement and a bunch of gotos, and I think it’s just difficult for a compiler to do good register allocation under these constraints.

If the C++ compiler could know the difference between fast paths and slow paths and do the register allocation solely for the fast paths (spilling everything for the slow paths) it might have a better shot. But still I think it’s just a hard problem.

Posted in upb | Leave a comment

Using dtrace on OS X to debug a performance problem

I recently ported upb’s table-based decoder to use setjmp/longjmp-based error handling. I did this largely for code simplicity and readability, so that the non-error code-paths didn’t have to check for errors all the time. But unfortunately I noticed a dramatic 75% performance decrease. What was going on?

A profile in Shark showed the majority of my time being spent in system calls, but didn’t make it clear which system calls were involved. It sounded like a job for dtrace.

I used the following script to dump all the system calls that were being issued by my process:

syscall:::entry
/pid == $target/
{
     @[probefunc] = count();
} 

I then ran this script on both my old (fast) benchmark and the new (unexpectedly slow) one:

$ sudo dtrace -c ./old-benchmark -s trace.d

  exit                                             1
  fstat64                                          1
  ioctl                                            1
  write_nocancel                                   1
  munmap                                          93
  mmap                                            94
  getrusage                                     7728    

$ sudo dtrace -c ./new-benchmark -s trace.d

  exit                                             1
  fstat64                                          1
  ioctl                                            1
  munmap                                           1
  write_nocancel                                   1
  getrusage                                      438
  sigaltstack                                 111773
  sigreturn                                   111774
  sigprocmask                                 223546

Sure enough, the new version is making a ton of system calls that weren’t happening before, and they appear to be signal related. I investigated the manpage of setjmp/longjmp and found:

The setjmp()/longjmp() pairs save and restore the signal mask while _setjmp()/_longjmp() pairs save and restore only the register set and the stack. (See sigprocmask(2).)

The sigsetjmp()/siglongjmp() function pairs save and restore the signal mask if the argument savemask is non-zero; otherwise, only the register set and the stack are saved.

I replaced my calls to setjmp/longjmp with sigsetjmp/siglongjmp, passing 0 for my signal mask, and the performance problems went away. Score one for dtrace.

Posted in Uncategorized | Leave a comment

upb status and preliminary performance numbers

The last few weeks have been very exciting for upb. On April 1 I checked in a JIT compiler for parsing protobufs, which one might think was an April Fool’s joke, but it’s real and the performance numbers so far have exceeded my expectations.

Why JIT?

Before I get to the numbers, I should explain what it even means for upb to have a JIT. If you’re not interested in the technical details, feel free to skip this section to go straight to the impacts and what this means for upb.

So why a JIT? After all, protobufs are not an imperative language. You can’t write an algorithm in a .proto file. You can’t apply compiler techniques like SSA, strength reduction, tracing, or really any of the things you’d expect from a JIT on a platform like the JVM. There is no byte code or intermediate representation to speak of. So why would upb have a JIT and what on earth would it do?

To review, a .proto file defines a schema for some messages. In this sense, it is comparable to JSON Schema or XML Schema.

message Person {
  required int32 id = 1;
  required string name = 2;
  optional string email = 3;
}

To support forwards and backwards compatibility, we don’t make any assumptions about which fields we’re going to see or in what order. Instead, on the wire each field is preceded by its tag number. Logically, the serialized version is something like:

- field number: 1, wire type: varint, value: 1
- field number 2, wire type: delimited, value: "Josh Haberman"
- field number 3, wire type: delimited, value: "jhaberman@gmail.com"

If you were going to write a parser for this, you might create a table of fields, keyed by field number, and a main parser loop that looks something like this:

while (!done) {
  int type = fields[get_field_number()];
  switch (type) {
    case WIRE_TYPE_VARINT:
      parse_varint();
      break;
    case WIRE_TYPE_DELIMITED:
      parse_delimited();
      break;
    // etc.
  }
}

We call this a “table-based parser.” We look up the type in a table of fields, and use that to branch to the correct value parsing function. There are minor variations on this, like using a function pointer instead of a switch statement, but it’s the same general idea. To see an actual program that implements this style of parser, check out my bare-bones 100-line protobuf parser that I wrote a few years ago: pb.c.

In a way this resembles a byte-code interpreter, and the serialized protobuf resembles byte-code. For example, see the main loop of the Lua interpreter. It has a similar pattern where it extracts an opcode and uses that to branch into a giant switch statement.

It turns out that if we generate code that is specific to one message type, and takes advantage of the fact that fields are usually encoded in order, we can beat this table-based parser significantly. Google’s main “protobuf” software has done this for a long time: their strategy is to generate C++ classes that are specific to a proto message type (like Person in the above example). This code has been highly tuned and optimized, and the generated parsing code looks something like this:

while((tag = input->ReadTag()) != 0) {
  field_number = GetFieldNumber(tag);
  wire_type = GetWireType(tag);
  switch(tag) {
    // optional float field1 = 1;
    case 1:
      if (wire_type != FLOAT_WIRE_TYPE) goto error;
      input->ParseFloat();
      // Saves the switch() and type check branch if fields are in order.
      if (input->ExpectTag(2, DOUBLE_WIRE_TYPE)) goto parse_field2;
      break;

    // optional double field2 = 2;
    case 2:
      if (wire_type != DOUBLE_WIRE_TYPE) goto error;
parse_field2:
      input->ParseDouble();
      // saves the switch() and type check branch if fields are in order.
      if (input->ExpectTag(3, INT32_WIRE_TYPE)) goto parse_field3
      break;

    // optional int32 field3 = 3;
    case 3:
    // [...]
  }
}

This code also has a big switch() statement at the top level, but the targets are field numbers instead of wire types. More importantly, each field’s case has a line like this:

if (input->ExpectTag(3, INT32_WIRE_TYPE)) goto parse_field3

This is key, because if the fields occur in order (which they usually do) then this branch will always be taken and it will be predicted correctly by the CPU. This can make a huge difference.

While this kind of C++ code generation has benefited Google’s protobuf software in speed, I always found it inconvenient from a dynamic languages perspective. What dynamic language user wants to have to generate C++ code and link that into their interpreter? We’re not talking about a single C++ extension that you compile once: every single message you define in a .proto file generates different C++ code. So you can’t just apt-get install libprotobuf-python and be done with it, you have to generate C++ for every message that your specific app wants to use.

This is where the JIT comes in. If we can generate machine code at runtime, we can get the speed benefits of the generated code without having to generate, compile, and link C++ into our interpreters or VMs. You could compile the library once, and after that you can dynamically load message definitions but still get the fastest possible parsing.

Preliminary results

I knew all of the theory I just described before I started writing a JIT. But theory is just that — theory. What would actually happen when I implemented a JIT?

The results exceeded my expectations. I still am being somewhat cautious, because there are so many dimensions to any performance equation, and because benchmarks are not guaranteed to correspond to real-world performance. That said, here are my preliminary results. You can reproduce these by obtaining upb from GitHub and running make benchmarks followed by make benchmark (you have to define -DUPB_USE_JIT_X64 to get the JIT).

Parsing an 80k protobuf into a data structure repeatedly,
calling Clear() between each parse.  (proto2 == Google protobuf)

proto2 table-based parser      38 MB/s
proto2 generated code parser   265 MB/s
upb table-based parser         340 MB/s
upb JIT parser                 741 MB/s

The results above are designed to be as apples-to-apples as possible. In other words, I disabled upb’s optimization that avoids memcpy() because Google’s protobuf doesn’t support it in its open-source release. I think the reason that even my table-based parser beats Google’s generated code here is because proto2 implements Clear() in a sub-optimal way that requires an extra pass over the message tree; see my comment in upb_msg.h for more info.

Things get even better if we allow ourselves to drop the constraint of being apples-to-apples with proto2. upb is capable of stream-based parsing like SAX, whereas proto2 only supports a DOM-based approach where you parse into a data structure. If we include the performance numbers for stream parsing, and for DOM-based parsing that avoids memcpy, we get:

upb table-based stream parser    420 MB/s
upb JIT no-memcpy() DOM parser   870 MB/s
upb JIT stream parser           1430 MB/s

If you’re like me when I first saw these numbers, your jaw is on the floor at seeing almost 1.5GB/s doing stream parsing of protocol buffers. At this point we are 5x proto2′s highly optimized generated code.

upb’s JIT isn’t complete and can’t handle all cases, but these performance numbers should still be valid because these benchmarks only used the parts of the format that are currently supported.

So when can I USE it? And can I help?

One reason I’ve avoided posting these extremely positive results so far is that I hate to get people excited about something that’s still not ready for users yet.

Adding support for the JIT required making extremely large and intrusive changes to the core interfaces, like this 3000 line refactoring that had to be completed before I could write even the first line of the JIT. I need to have the flexibility to change core interfaces like this still, because the design is still converging. So as much as I wish I could say it’s ready to go, I still need to hold off on a real release.

The good news is that the design is still making huge strides. In the last few weeks I’ve been refining my scheme for how upb will integrate into different VM’s and language runtimes, and I feel more confident than ever that the language bindings for Lua, Python, etc. will be some of the fastest extensions ever offered for this kind of functionality.

As a preview for where this is going, I think that upb will even be usable as a JSON library, offering speed that is as good or greater than any existing JSON libraries. JSON can be directly mapped onto a subset of protobufs (JSON only uses double-precision numbers) and the protobuf text format is already extremely similar to JSON. So all the work I’ve done to optimize memory layout and dynamic language object access should apply.

And while I’m really happy to get offers to help out, it’s still at a stage of design where I need to be doing most of the work. Working on upb generally involves pacing around my apartment deep in thought about all the requirements and use cases I want to satisfy and brainstorming a million different approaches until I converge on the the one that is the smallest, fastest, and most flexible. It’s hard work but I love it, and the more time that passes the more convinced I am that this is going to be big.

Posted in upb | 7 Comments