Parsing Protobuf at 2+GB/s: How I Learned To Love Tail Calls in C
An exciting feature just landed in the main branch of the Clang
compiler. Using the [[clang::musttail]]
or
__attribute__((musttail))
statement attributes, you can now get guaranteed
tail calls in C, C++, and Objective-C.
While tail calls are usually associated with a functional programming style, I am interested in them purely for performance reasons. It turns out that in some cases we can use tail calls to get better code out of the compiler than would otherwise be possible—at least given current compiler technology—without dropping to assembly.
Applying this technique to protobuf parsing has yielded amazing results: we have managed to demonstrate protobuf parsing at over 2GB/s, more than double the previous state of the art. There are multiple techniques that contributed to this speedup, so “tail calls == 2x speedup” is the wrong message to take away. But tail calls are a key part of what made that speedup possible.
In this blog entry I will describe why tail calls are such a powerful
technique, how we applied them to protobuf parsing, and how this technique
generalizes to interpreters. I think it’s likely that all of the major language
interpreters written in C (Python, Ruby, PHP, Lua, etc.) could get significant
performance benefits by adopting this technique. The main downside is
portability: currently musttail
is a nonstandard compiler extension, and
while I hope it catches on it will be a while before it spreads widely enough
that your system’s C compiler is likely to support it. That said, at build
time you can compromise some efficiency for portability if you detect that
musttail
is not available.
Tail Call Basics
A tail call is any function call that is in tail position, the final action to
be performed before a function returns. When tail call optimization occurs,
the compiler emits a jmp
instruction for the tail call instead of call
.
This skips over the bookkeeping that would normally allow the callee g()
to
return back to the caller f()
, like creating a new stack frame or pushing the
return address. Instead f()
jumps directly to g()
as if it were part of
the same function, and g()
returns directly to whatever function called
f()
. This optimization is safe because f()
’s stack frame is no longer
needed once the tail call has begun, since it is no longer possible to access
any of f()
’s local variables.
While this may seem like a run-of-the-mill optimization, it has two very
important properties unlock new possibilities in the kinds of algorithms we can
write. First, it reduces the stack memory from from ++O(n)++ to ++O(1)++ when
making ++n++ consecutive tail calls, which is important because stack memory is
limited and stack overflow will crash your program. This means that certain
algorithms are not actually safe to write unless this optimization is
performed. Secondly, jmp
eliminates the performance overhead of call
, such
that a function call can be just as efficient as any other branch. These two
properties enable us to use tail calls as an efficient alternative to
normal iterative control structures like for
or while
.
This is by no means a new idea, indeed it goes back to at least 1977 when Guy
Steele wrote an entire paper
arguing that procedure calls make for cleaner designs than GOTO
, and that
tail call optimization can make them just as fast. This was one of the “Lambda
Papers”
written between 1975 and 1980 that developed many of the ideas underlying Lisp
and Scheme.
Tail call optimization is not even new to Clang: like GCC and many other
compilers, Clang was already capable of optimizing tail calls. In fact, the
musttail
attribute in our first example above did not change the output of
the compiler at all: Clang would already have optimized the tail call under
-O2
.
What is new is the guarantee. While compilers will often optimize tail calls successfully, this is best-effort, not something you can rely on. In particular, the optimization will most likely not happen in non-optimized builds:
Here the tail call was compiled to an actual call
, so we are back to ++O(n)++
stack space. This is why we need musttail
: unless we can get a guarantee from
the compiler that our tail calls will always be optimized, in all build
modes, it isn’t safe to write algorithms that use tail calls for iteration. It
would be a pretty severe limitation to have code that only works when
optimizations are enabled.
The Trouble With Interpreter Loops
Compilers are incredible pieces of technology, but they are not perfect. Mike Pall, author of LuaJIT, decided to write LuaJIT 2.x’s interpreter in assembly rather than C, and he cites this decision as a major factor that explains why LuaJIT’s interpreter is so fast. He later went into more detail about why C compilers struggle with interpreter main loops. His two most central points are:
- The larger a function is, and the more complex and connected its control flow, the harder it is for the compiler’s register allocator to keep the most important data in registers.
- When fast paths and slow paths are intermixed in the same function, the presence of the slow paths compromises the code quality of the fast paths.
These observations closely mirror our experiences optimizing protobuf parsing. The good news is that tail calls can help solve both of these problems.
It may seem odd to compare interpreter loops to protobuf parsers, but the nature of the protobuf wire format makes them more similar than you might expect. The protobuf wire format is a series of tag/value pairs, where the tag contains a field number and wire type. This tag acts similarly to an interpreter opcode: it tells us what operation we need to perform to parse this field’s data. Like interpreter opcodes, protobuf field numbers can come in any order, so we have to be prepared to dispatch to any part of the code at any time.
The natural way to write such a parser is to have a while
loop surrounding a
switch
statement, and indeed this has been the state of the art in protobuf
parsing for basically as long as protobufs have existed. For example, here is
some parsing code from the current C++ version of protobuf. If we represent the
control flow graphically, we get something like this:
But this is incomplete, because at almost every stage there are things that can go wrong. The wire type could be wrong, or we could see some corrupt data, or we could just hit the end of the current buffer. So the full control flow graph looks more like this.
We want to stay on the fast paths (in blue) as much as possible, but when we hit a hard case we have to execute some fallback code to handle it. These fallback paths are usually bigger and more complicated than the fast paths, touch more data, and often even make out-of-line calls to other functions to handle the more complex cases.
Theoretically, this control flow graph paired with a profile should give the compiler all of the information it needs to generate the most optimal code. In practice, when a function is this big and connected, we often find ourselves fighting the compiler. It spills an important variable when we want it to keep it in a register. It hoists stack frame manipulation that we want to shrink wrap around a fallback function invocation. It merges identical code paths that we wanted to keep separate for branch prediction reasons. The experience can end up feeling like trying to play the piano while wearing mittens.
Improving Interpreter Loops With Tail Calls
The analysis above is mainly just a rehash of of Mike’s observations about interpreter main loops. But instead of dropping to assembly, as Mike did with LuaJIT 2.x, we found that a tail call oriented design could give us the control we needed to get nearly optimal code from C. I worked on this together with my colleague Gerben Stavenga, who came up with much of the design. Our approach is similar to the design of the wasm3 WebAssembly interpreter which describes this pattern as a “meta machine”.
The code for our 2+GB/s protobuf parser was submitted to
upb, a small protobuf library written
in C, in pull/310. While it
is fully working and passing all protobuf conformance tests, it is not rolled
out anywhere yet, and the design has not been implemented in the C++ version of
protobuf. But now that musttail
is available in Clang (and upb has been
updated to use it), one of
the biggest barriers to fully productionizing the fast parser has been removed.
Our design does away with a single big parse function and instead gives each operation its own small function. Each function tail calls the next operation in sequence. For example here is a function to parse a single fixed-width field. (This code is simplified from the actual code in upb; there are many details of our design that I am leaving out of this article, but will hopefully cover in future articles).
#include <stdint.h>
#include <stddef.h>
#include <string.h>
typedef void *upb_msg;
struct upb_decstate;
typedef struct upb_decstate upb_decstate;
// The standard set of arguments passed to each parsing function.
// Thanks to x86-64 calling conventions, these will be passed in registers.
#define UPB_PARSE_PARAMS \
upb_decstate *d, const char *ptr, upb_msg *msg, intptr_t table, \
uint64_t hasbits, uint64_t data
#define UPB_PARSE_ARGS d, ptr, msg, table, hasbits, data
#define UNLIKELY(x) __builtin_expect(x, 0)
#define MUSTTAIL __attribute__((musttail))
const char *fallback(UPB_PARSE_PARAMS);
const char *dispatch(UPB_PARSE_PARAMS);
// Code to parse a 4-byte fixed field that uses a 1-byte tag (field 1-15).
const char *upb_pf32_1bt(UPB_PARSE_PARAMS) {
// Decode "data", which contains information about this field.
uint8_t hasbit_index = data >> 24;
size_t ofs = data >> 48;
if (UNLIKELY(data & 0xff)) {
// Wire type mismatch (the dispatch function xor's the expected wire type
// with the actual wire type, so data & 0xff == 0 indicates a match).
MUSTTAIL return fallback(UPB_PARSE_ARGS);
}
ptr += 1; // Advance past tag.
// Store data to message.
hasbits |= 1ull << hasbit_index;
memcpy((char*)msg + ofs, ptr, 4);
ptr += 4; // Advance past data.
// Call dispatch function, which will read the next tag and branch to the
// correct field parser function.
MUSTTAIL return dispatch(UPB_PARSE_ARGS);
}
For a function this small and simple, Clang gives us code that is basically impossible to beat.
upb_pf32_1bt: # @upb_pf32_1bt
mov rax, r9
shr rax, 24
bts r8, rax
test r9b, r9b
jne .LBB0_1
mov r10, r9
shr r10, 48
mov eax, dword ptr [rsi + 1]
mov dword ptr [rdx + r10], eax
add rsi, 5
jmp dispatch # TAILCALL
.LBB0_1:
jmp fallback # TAILCALL
Note that there is no prologue or epilogue, no register spills, indeed there is
no usage of the stack whatsoever. The only exits are jmp
s from the two tail
calls, but no code is required to forward the parameters, because the arguments
are already sitting in the correct registers. Pretty much the only improvement
we could hope for is to get a conditional jump for the tail call, jne
fallback
, instead of jne
followed by jmp
.
If you were looking at a disassembly of this code without symbol information, you would have no reason to know that this was an entire function. It could just as easily be a basic block from a larger function. And that, in essence, is exactly what we are doing. We are taking an interpreter loop that is conceptually a big complicated function and programming it block by block, transferring control flow from one to the next via tail calls. We have full control of the register allocation at every block boundary (well, for six registers at least), and by using the same set of parameters for each function, we eliminate the need to shuffle any values around from one call to the next. As long as the function is simple enough to not spill those six registers, we’ve achieved our goal of keeping our most important state in registers throughout all of the fast paths.
We can optimize every instruction sequence independently, and crucially, the
compiler will treat each sequence as independent too because they are in
separate functions (we can prevent inlining with noinline
if necessary).
This solves the problem we described earlier where the code from fallback paths
would degrade the code quality for fast paths. If we put the slow paths in
entirely separate functions from the fast paths, we can be guaranteed that
the fast paths will not suffer. The nice assembly sequence we see above is
effectively frozen, unaffected by any changes we make to other parts of the
parser.
If we apply this pattern to Mike’s example from LuaJIT, we can more or less match his hand-written assembly with only minor code quality defects:
#define PARAMS unsigned RA, void *table, unsigned inst, \
int *op_p, double *consts, double *regs
#define ARGS RA, table, inst, op_p, consts, regs
typedef void (*op_func)(PARAMS);
void fallback(PARAMS);
#define UNLIKELY(x) __builtin_expect(x, 0)
#define MUSTTAIL __attribute__((musttail))
void ADDVN(PARAMS) {
op_func *op_table = table;
unsigned RC = inst & 0xff;
unsigned RB = (inst >> 8) & 0xff;
unsigned type;
memcpy(&type, (char*)®s[RB] + 4, 4);
if (UNLIKELY(type > -13)) {
return fallback(ARGS);
}
regs[RA] += consts[RC];
inst = *op_p++;
unsigned op = inst & 0xff;
RA = (inst >> 8) & 0xff;
inst >>= 16;
MUSTTAIL return op_table[op](ARGS);
}
The resulting assembly is:
ADDVN: # @ADDVN
movzx eax, dh
cmp dword ptr [r9 + 8*rax + 4], -12
jae .LBB0_1
movzx eax, dl
movsd xmm0, qword ptr [r8 + 8*rax] # xmm0 = mem[0],zero
mov eax, edi
addsd xmm0, qword ptr [r9 + 8*rax]
movsd qword ptr [r9 + 8*rax], xmm0
mov edx, dword ptr [rcx]
add rcx, 4
movzx eax, dl
movzx edi, dh
shr edx, 16
mov rax, qword ptr [rsi + 8*rax]
jmp rax # TAILCALL
.LBB0_1:
jmp fallback
The only opportunity for improvement I see here, aside from the jne fallback
issue mentioned before, is that for some reason the compiler doesn’t want to
generate jmp qword ptr [rsi + 8*rax]
. Instead it prefers to load into rax
and then follow with jmp rax
. These are minor code generation issues that
could hopefully be fixed in Clang without too much work.
Limitations
One of the biggest caveats with this approach is that these beautiful assembly sequences get catastrophically pessimized if any non tail calls are present. Any non tail call forces a stack frame to be created, and a lot of data spills to the stack.
#define PARAMS unsigned RA, void *table, unsigned inst, \
int *op_p, double *consts, double *regs
#define ARGS RA, table, inst, op_p, consts, regs
typedef void (*op_func)(PARAMS);
void fallback(PARAMS);
#define UNLIKELY(x) __builtin_expect(x, 0)
#define MUSTTAIL __attribute__((musttail))
void ADDVN(PARAMS) {
op_func *op_table = table;
unsigned RC = inst & 0xff;
unsigned RB = (inst >> 8) & 0xff;
unsigned type;
memcpy(&type, (char*)®s[RB] + 4, 4);
if (UNLIKELY(type > -13)) {
// When we leave off "return", things get real bad.
fallback(ARGS);
}
regs[RA] += consts[RC];
inst = *op_p++;
unsigned op = inst & 0xff;
RA = (inst >> 8) & 0xff;
inst >>= 16;
MUSTTAIL return op_table[op](ARGS);
}
This leads to the very unfortunate:
ADDVN: # @ADDVN
push rbp
push r15
push r14
push r13
push r12
push rbx
push rax
mov r15, r9
mov r14, r8
mov rbx, rcx
mov r12, rsi
mov ebp, edi
movzx eax, dh
cmp dword ptr [r9 + 8*rax + 4], -12
jae .LBB0_1
.LBB0_2:
movzx eax, dl
movsd xmm0, qword ptr [r14 + 8*rax] # xmm0 = mem[0],zero
mov eax, ebp
addsd xmm0, qword ptr [r15 + 8*rax]
movsd qword ptr [r15 + 8*rax], xmm0
mov edx, dword ptr [rbx]
add rbx, 4
movzx eax, dl
movzx edi, dh
shr edx, 16
mov rax, qword ptr [r12 + 8*rax]
mov rsi, r12
mov rcx, rbx
mov r8, r14
mov r9, r15
add rsp, 8
pop rbx
pop r12
pop r13
pop r14
pop r15
pop rbp
jmp rax # TAILCALL
.LBB0_1:
mov edi, ebp
mov rsi, r12
mov r13d, edx
mov rcx, rbx
mov r8, r14
mov r9, r15
call fallback
mov edx, r13d
jmp .LBB0_2
To avoid this, we tried to follow a discipline of only calling other functions via inlining or tail calls. This can get annoying if an operation has multiple points at which an unusual case can occur that is not an error. For example, when we are parsing protobufs, the fast and common case is that varints are only one byte long, but longer varints are not an error. Handling the unusual case inline can compromise the quality of the fast path if the fallback code is too complicated. But tail calling to a fallback function gives no way of easily resuming the operation once the unusual case is handled, so the fallback function must be capable of pushing forward and completing the operation. This leads to code duplication and complexity.
Ideally this issue could be solved by adding
__attribute__((preserve_most))
to the fallback functions and then calling them normally, without tail calls.
The preserve_most
attribute makes the callee responsible for preserving
nearly all registers, which moves the cost of the register spills to the
fallback functions where we want it. We experimented some with this attribute
but ran into some mysterious problems that we were not able to get to the
bottom of. It may have been an error on our part; revisiting this is future
work. [Update: 2023-03-14: it turns out this was a bug in Clang, which was
fixed two months ago]
The other major limitation is that musttail
is not portable. I very much
hope that the attribute will catch on, spreading to GCC, Visual C++, and other
popular compilers, and even get standardized someday. But that day is far off,
so what to do in the meantime?
When musttail
is not available, we need to perform at least one true
return
, without a tail call, for every conceptual loop iteration. We have
not yet implemented this fallback in upb, but I expect it will involve a macro
that either tail calls to dispatch or just returns, based on the availability
of musttail.