Tuesday, May 14, 2013

Deep Wizardry: Stack Unwinding

This weekend I budgeted "a few hours" to learn about stack unwinding, but oh what a journey down the rabbit hole it became. This esoteric topic lies at the intersection of compilers, linkers, loaders, debuggers, ABIs, and language runtimes. There is precious little documentation about it and, from what I can tell, a small handful of guys who answer all the questions about it. People speculate about what happens if Linus Torvalds gets hit by a bus; I think we're just as screwed if the fellowship of linker/loader wizards bite it (or decide to change careers). But I digress.

I feel compelled to write this article in Q&A style. Plato I'm not, but maybe it's because there are so many questions to answer.

What is stack unwinding?

In any language that supports recursive functions (ie. pretty much everything except Fortran 77 and Brainf*ck) the language runtime keeps a stack of what functions are currently executing. Stack unwinding is a way of inspecting, and possibly modifying, that stack.

Why would you want to do that?

The answer may seem obvious, but there are several related, yet subtly different, situations where unwinding is useful or necessary:
  1. As a runtime control-flow mechanism (C++ exceptions, C longjmp(), etc).
  2. In a debugger, to show the user the stack.
  3. In a profiler, to take a sample of the stack.
  4. From the program itself (like from a crash handler to show the stack).
These have subtly different requirements. Some of these are performance-critical, some are not. Some require the ability to reconstruct registers from outer frame, some do not. But we'll get into all that in a second.

Ok, lay it on me: how do you unwind the stack?

Let's start with the simplest case, which is the longjmp() function from C (see The Wikipedia entry for setjmp.h for the basics). longjmp() jumps down the stack to the corresponding setjmp(), across as many stack frames as there are in between. Unlike the other approaches we will see, longjmp() does not iterate over the list of frames, it just blasts through them all at once with wanton disregard for anything precious or beautiful that they might have contained.

The implementations of setjmp() and longjmp are surprisingly simple, though heavily architecture specific. For example, let's take a look at the x86-64 implementation of setjmp/longjmp from the FreeBSD kernel (you can look at the userspace version from libc too, but it is slightly more complicated because it also has to save/restore other status registers). I've converted this to Intel syntax:
_setjmp:
  mov    QWORD PTR [rdi],rbx        ; save rbx
  mov    QWORD PTR [rdi+0x8],rsp    ; save rsp
  mov    QWORD PTR [rdi+0x10],rbp   ; save rbp
  mov    QWORD PTR [rdi+0x18],r12   ; save r12
  mov    QWORD PTR [rdi+0x20],r13   ; save r13
  mov    QWORD PTR [rdi+0x28],r14   ; save r14
  mov    QWORD PTR [rdi+0x30],r15   ; save r15
  mov    rdx,QWORD PTR [rsp]        ; get return address
  mov    QWORD PTR [rdi+0x38],rdx   ; save return address
  xor    eax,eax                ; return 0
  ret

_longjmp:
  mov    rbx,QWORD PTR [rdi]        ; restore rbx
  mov    rsp,QWORD PTR [rdi+0x8]    ; restore rsp
  mov    rbp,QWORD PTR [rdi+0x10]   ; restore rbp
  mov    r12,QWORD PTR [rdi+0x18]   ; restore r12
  mov    r13,QWORD PTR [rdi+0x20]   ; restore r13
  mov    r14,QWORD PTR [rdi+0x28]   ; restore r14
  mov    r15,QWORD PTR [rdi+0x30]   ; restore r15
  mov    rdx,QWORD PTR [rdi+0x38]   ; get return address
  mov    QWORD PTR [rsp],rdx        ; restore return address
  xor    eax,eax
  inc    eax                    ; return 1
  ret
Maybe it's just me, but I expected these functions to look way more gnarly. All setjmp is doing is saving a bunch of registers including rsp (the stack pointer) and the return address into the jmp_buf array that was passed as a parameter. All longjmp is doing is restoring those registers and the return address of the original setjmp call. This tweaking of the return address is how the call to longjmp manages to "return from" the original setjmp call.

This technique is simple, but so primitive that you might not even call it "unwinding." We are just setting the stack pointer rsp to its old value, which instantly discards the entire contents of the stack between the longjmp and setjmp calls. Any resources that were stored there are leaked! Any other data that was in the middle of being modified might be corrupted, unless you took care to leave it in a consistent state before making the function call that ultimately called longjmp.

What about the other registers, like rax?

Good eye! The x86-64 architecture has 16 general-purpose registers, but we only saved 7 of them. What about the others? Well recall that setjmp and longjmp are implemented as functions and as such they follow the standard x86-64 calling convention. The x86-64 calling convention dictates that the 7 registers above are owned by the caller (also known as callee-save), which means that setjmp is responsible for restoring these registers before it returns. The other registers are owned by the callee function, the callee is allowed to clobber them however it likes before returning. So we're under no obligation to restore these registers before returning, and the caller can't make any assumptions about what they will contain.

Who decides these calling conventions?

Probably the same gurus who answer all my linker/loader questions. In my mind they all sit in a circle and meditate, reflecting on ancient texts from The Cult of the Bound Variable.

Ok, that makes sense, stack unwinding doesn't seem so bad.

That's because I started with the simplest case. Every other stack-unwinding mechanism has to walk the stack, frame-by-frame.

How do you walk the list of frames?

There are two main ways. I'll start with the simpler one. If you've ever looked at a function in assembly, you might have seen a prologue and epilogue that look like this:
  push  rbp
  mov   rbp, rsp
  ;  other stuff...
  pop   rbp
  ret
I remember when I was taking assembly language in college I could not understand why I had to do this. What's the point of having both a base pointer and a stack pointer, both pointing at the stack? As long as you adjust the stack pointer back to where it started before you call ret you should be fine, right?

It turns out this convention comes in really handy when you want to walk the stack (like to generate a backtrace), because it connects all the stack frames together into a linked list whose head pointer is rbp! This means that generating the back trace is really as simple as a linked list traversal.

Aha! So that's how debugger backtraces and backtrace() from libc work!

Actually... no. But that's how they used to work! On 32-bit x86, it used to be that everyone compiled with a frame pointer, because this was the only way of getting stack traces in the debugger. But this scheme has two significant overheads:
  1. One general-purpose register is used up at all times. On x86, this meant 1 out of only 8 registers was gone. On x86-64, it would be 1 out of 16.
  2. Every function has to have this prologue and epilogue I showed above.
To avoid these overheads, the people who were developing the ABI for x86-64 decided to no longer require a base pointer in every function. Here is some more info from someone who was not thrilled with that decision. (Apparently it caused problems for DTrace). And as of recently (around GCC 4.6, from what I'm told), omitting the frame pointer has become the default on 32-bit x86 too, which frees up a register and improves performance.

How can you walk up the stack without a base pointer?

I know, it seems hopeless. Without a base pointer stored in each stack frame, how do you find the previous frame?

What the tools do nowadays is store some information in a different part of the file that provides "unwind information." Unwind information tells you, for an arbitrary instruction:
  1. how much the stack pointer has been adjusted (or equivalently, how much the unwinder has to adjust it back to get to the base of the frame, which is necessary to find the return address).
  2. where in the frame the callee-save registers are stored, if they have been clobbered since the function began.
  3. (optionally) a pointer to a custom "personality function" for unwinding the stack
  4. lots of other things? I don't fully understand this yet.
While more complicated, this gives us a capability we didn't have with the simple linked-list approach: we can restore the callee-save registers of previous frames, so a debugger can show local variables that were saved in registers.

That sounds useful! Who defines the format of debug information?

Debug information on most modern UNIX-like OS's is stored in the DWARF format (a clever pun on the ELF object file format). DWARF specifies that this unwind information should go in a section called .debug_frame.

I see, so the modern way to specify unwind information is in the .debug_frame section.

That would be too easy. When the x86-64 ABI was being developed, they decided to specify that debug information would go in the .eh_frame section, in a slightly different format than is defined by DWARF. .eh_frame is what all the tools appear to use in practice. And there is a standard, portable library for unwinding based on the .eh_frame/.debug_frame mechanism called libunwind.

Gotcha, so gdb and g++'s exception support use libunwind?

No, they have their own unwinder implementations that are completely independent of libunwind (gcc's unwind-dw2.c and gdb's dwarf2-frame.c). Gcc can link against libunwind but uses its own unwinder by default. gdb has a whole bunch of unwinders, including one that scans the machine code to figure out how the frame was setup.

Ok, so what about C++ exceptions?

C++ exceptions use .eh_frame information also. They use this information to not only recreate the frame, but execute code in it (C++ destructors and catch blocks) and finally to dismantle it before transferring control to the frame with the matching catch block.

How on Earth does that work?

I don't really know, but when I find out I'll post another blog entry.

Friday, March 1, 2013

C/C++ Gripe #1: integer types

C is a great language. When someone asks "what would you change about C?" it is not easy for me to think of something that I think it just plain got wrong. And while C++ is large and complicated, I generally feel that, for what it is trying to be it is pretty well done too.

I wanted to preface this entry with that, because the word "gripe" could be taken to mean that I am a C or C++ hater. Not the case; they are my favorite languages that I use regularly. But with the benefit of hindsight, I think it's worth mentioning a few of their design choices here and there that make life difficult, where a genuinely better alternative exists. Which brings me to this entry.

C's integer types are well-explained in the Wikipedia entry C data types. The types char, short, int, long, long long and their unsigned equivalents are defined without specifying their exact size, but instead according to a loose set of rules which, in my experience, are rarely useful.

Until C99 there was no standard way of declaring integers of a specified size (ie. a 16-bit signed integer). You could declare a signed short; this is guaranteed to be at least 16 bits, but it could be larger. The lack of fixed-width types led to every project reinventing the same typedefs for these, over and over. You can have wxInt32 from wxWidgets, or qint32 from Qt or gint32 from glib. Almost every C or C++ library would eventually find itself defining these same typedefs. But thankfully in C99 we got stdint.h which gives us fixed width types like int32_t in the standard library! Problem solved, no?

Well, not quite unfortunately. Since int32_t is just a typedef, there can be multiple primitive types that are 32 bits wide. For example, in the ILP32 programming model, both int and long are 32 bits. So it's totally arbitrary which of these typedefs is in stdint.h:
// Both of these are equally valid to have in stdint.h:
typedef int int32_t;
typedef long int32_t;
And the really unfortunate part is that int and long are still distinct, incompatible types, which means that this program won't compile:
// From library A:
typedef int a_int32_t;

// Passes a pointer to a function that takes a 32-bit integer.
void regfunc(void (*f)(a_int32_t));

// From library B:
typedef long b_int32_t;

void my_callback(b_int32_t x) { /* ... */ }

int main() {
  regfunc(&my_callback);
}
This code will fail to compile because void mycallback(long) is not compatible with void f(int) even though both long and int are 32 bits! And in this case both were used through fixed-width typedefs, so the code looks like it should work.

What would have been better is if the primitive types had been defined in terms of the fixed-width types (int32_t, uint32_t, etc). Then, if desired, the more loosely-defined types like int, long, etc. could be the typedefs. If things were defined this way, then you would never run into this problem where two integer types are the same size, yet are incompatible.

A possible idea for improving the status quo would be to make primitive types compatible if they are the same size. That would make it legal to convert between the two function pointer types, and would probably require basically no work in real-world compilers to enable.

However that doesn't solve a similar problem in C++, which happens when you partially specialize on a_int32_t (for example) only to find that your partial specialization doesn't apply to b_int32_t. Fixing this is not as easy, because some users could have code that depends on these two types having different specializations, even though they are the same size.

In closing: if you invent a new language, please make the primitive integer types fixed-width. Your users will thank you.

Thursday, January 3, 2013

Reader Challenge: Optimize the BF JIT

In my previous entry Hello, JIT World: The Joy of Simple JITs, the final example was a JIT for BF. The performance of this JIT beat an optimizing interpreter, but lost to a translation to C (which I compiled with gcc). I speculated that the reason for this was that my JIT didn't have a register allocator (or perform any optimizations at all).

When I asked Mike Pall, author of LuaJIT and DynASM for feedback on the article, he suggested several optimizations that would be much simpler than register allocation. I've implemented a few of these already on my own, and already was able to increase performance by 2-3x. Implement all of them, and the JIT would likely beat the compiler.

The process of implementing these optimizations was extremely fun, so I want to give you, my readers, a chance to try it out before I reveal my answers. I encourage you to start by forking my GitHub repository.

Here are the optimizations Mike suggested (I'll try to give the idea without giving away what you'd optimize these patterns into):
  1. implement a peephole optimizer that optimizes runs of <<<, >>>, +++, and --- (each run can be optimized into a single instruction).
  2. fold PTR movements into the inc/dec/add/sub arguments themselves (use x86 addressing to avoid having to move the pointer between every +/- instruction).
  3. optimize patterns like [-] and [->>>+<<<], etc.
  4. take advantage of flag propagation from add/sub to optimize away cmp instructions when possible (but beware of inc/dec and partial flag stalls (see section 5.8, p.58 in The microarchitecture of Intel, AMD and VIA CPUs)).
These are just ideas -- not all of them are necessary for good performance and there are almost certainly other ideas I haven't listed here. Try it out -- it's fun!

Monday, December 31, 2012

Hello, JIT World: The Joy of Simple JITs

This is a demonstration of how simple and enjoyable small JITs (just-in-time compilers) can be. The word "JIT" tends to invoke an image of deepest wizardry, something that only teams of the most hard-core compiler guys would ever dream of creating. It makes you think of the JVM or .NET, very large runtimes with hundreds of thousands of lines of code. You never see "Hello, World!" sized programs for JITs that do something interesting in a small amount of code. This article is an attempt to change that.

If you think about it, a JIT is not that different from a program that calls printf(), a JIT just so happens to emit machine code rather than a message like "Hello, World!" Sure, JITs like the JVM are highly complicated beasts, but that's because they are implementing a complicated platform and performing aggressive optimizations. If we work with something simpler, our program can be much simpler too.

The most difficult part of writing a simple JIT is encoding the instructions so they can be understood by your target CPU. For example, on x86-64, the instruction push rbp is encoded as the byte 0x55. Implementing this encoding is boring and requires reading lots of CPU manuals, so we're going to skip that part. Instead we'll use Mike Pall's very nice library DynASM to handle the encoding. DynASM has a very novel approach that lets you intermix the assembly code you're generating with the C code of your JIT, which lets you write a JIT in a very natural and readable way. It supports many CPU architectures (x86, x86-64, PowerPC, MIPS, and ARM at the time of this writing) so you're unlikely to be limited by its hardware support. DynASM is also exceptionally small and unimposing; its entire runtime is contained in a 500-line header file.

I should briefly clarify my terminology. I am calling a "JIT" any program that executes machine code that was generated at runtime. Some authors use this term in a more specific way, and only consider a program a JIT if it is a hybrid interpreter/compiler that generates machine code in small fragments, on-demand. These authors would call the more general technique of run-time code generation dynamic compilation. But "JIT" is the more common and recognizable term, and is often applied to a variety of approaches that do not meet the most rigid definition of a JIT, like the Berkeley Packet Filter JIT.

Hello, JIT World!

Without further ado, let's jump into our first JIT. This and all the other programs are in my GitHub repository jitdemo. The code is Unix-specific since it uses mmap(), and we're generating x86-64 code so you'll need a processor and OS that support that. I've tested that it works on Ubuntu Linux and Mac OS X.

We won't even use DynASM for this first example, to keep it as bare-bones as possible. This program is called jit1.c.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>

int main(int argc, char *argv[]) {
  // Machine code for:
  //   mov eax, 0
  //   ret
  unsigned char code[] = {0xb8, 0x00, 0x00, 0x00, 0x00, 0xc3};

  if (argc < 2) {
    fprintf(stderr, "Usage: jit1 <integer>\n");
    return 1;
  }

  // Overwrite immediate value "0" in the instruction
  // with the user's value.  This will make our code:
  //   mov eax, <user's value>
  //   ret
  int num = atoi(argv[1]);
  memcpy(&code[1], &num, 4);

  // Allocate writable/executable memory.
  // Note: real programs should not map memory both writable
  // and executable because it is a security risk.
  void *mem = mmap(NULL, sizeof(code), PROT_WRITE | PROT_EXEC,
                   MAP_ANON | MAP_PRIVATE, -1, 0);
  memcpy(mem, code, sizeof(code));

  // The function will return the user's value.
  int (*func)() = mem;
  return func();
}
It may seem hard to believe at 33 lines, but this is an legit JIT (try saying that five times fast). It dynamically generates a function that returns a runtime-specified integer and then runs that function. You can verify that it's working:
$ ./jit1 42 ; echo $?
42
You'll notice that I have to use mmap() to allocate the memory instead of malloc(), the normal way of getting memory from the heap. This is necessary because we need the memory to be executable so we can jump to it without crashing the program. On most systems the stack and heap are configured not to allow execution because if you're jumping to the stack or heap it means something has gone very wrong. Worse, a hacker who is taking advantage of a buffer overflow can use an executable stack to more easily exploit the bug. So generally we want to avoid mapping any memory both writable and executable, and it's a good habit to follow this rule in your own programs too. I broke this rule above, but that was just to keep our first program as simple as possible.

I also cut corners by not releasing the memory I allocated. We'll remedy this soon enough; mmap() has a corresponding function munmap() that we can use to release memory back to the OS.

You might wonder why you can't call a function that just changes the permissions of the memory you get from malloc(). Having to allocate executable memory in a totally different way sounds like a drag. In fact there is a function that can change permissions on memory you already have; it's called mprotect(). But these permissions can only be set on page boundaries; malloc() will give you some memory from the middle of a page, a page that you do not own in its entirety. If you start changing permissions on that page you'll affect any other code that might be using memory in that page.

Hello, DynASM World!

DynASM is a part of the most impressive LuaJIT project, but is totally independent of the LuaJIT code and can be used separately. It consists of two parts: a preprocessor that converts a mixed C/assembly file (*.dasc) to straight C, and a tiny runtime that links against the C to do the work that must be deferred until runtime.


This design is nice because all of the hairy and complicated code to parse assembly language and encode machine code instructions can be written in a high-level, garbage collected language (Lua), but this is only needed at build time; the runtime has no Lua dependency. This is a case of having your cake and eating it too: the majority of DynASM can be written in Lua without the runtime having to pay for (or depend on) Lua.

For our first DynASM example, I'll write a program that generates exactly the same function as our last example. That way we can compare apples to apples and see the difference between the two approaches, and understand what DynASM is buying us.
// DynASM directives.
|.arch x64
|.actionlist actions

// This define affects "|" DynASM lines.  "Dst" must
// resolve to a dasm_State** that points to a dasm_State*.
#define Dst &state

int main(int argc, char *argv[]) {
  if (argc < 2) {
    fprintf(stderr, "Usage: jit1 <integer>\n");
    return 1;
  }

  int num = atoi(argv[1]);
  dasm_State *state;
  initjit(&state, actions);

  // Generate the code.  Each line appends to a buffer in
  // "state", but the code in this buffer is not fully linked
  // yet because labels can be referenced before they are
  // defined.
  //
  // The run-time value of C variable "num" is substituted
  // into the immediate value of the instruction.
  |  mov eax, num
  |  ret

  // Link the code and write it to executable memory.
  int (*fptr)() = jitcode(&state);

  // Call the JIT-ted function.
  int ret = fptr();
  assert(num == ret);

  // Free the machine code.
  free_jitcode(fptr);

  return ret;
}
This is not the full program; some helper functionality for initializing DynASM and allocating/freeing executable memory is defined in dynasm-driver.c. This shared helper code will be the same in all of our examples, so we omit it here; it is fairly straightforward and well-commented in the repository.

The key difference to observe is how we generate instructions. Our .dasc file can include assembly language, similar to how you would write in a .S file. Files that begin with a pipe (|) are interpreted by DynASM and can contain assembly language instructions or directives. This is a far more powerful approach than our first example. In particular, note how one of the arguments to our mov instruction refers to a C variable; DynASM knows how to substitute the value of this variable into the instruction when it is generated.

To see how this is accomplished, we can look at the output of the preprocessor in jit2.h (which was generated from jit2.dasc). I've excerpted the interesting parts; the rest of the file is just passed through unmodified.
//|.arch x64
//|.actionlist actions
static const unsigned char actions[4] = {
  184,237,195,255
};

// [...]

//|  mov eax, num
//|  ret
dasm_put(Dst, 0, num);
Here we see the source lines we wrote in the the .dasc file (now commented out) and the lines that resulted from them. The "action list" is the buffer of data that is generated by the DynASM preprocessor. It is byte-code that will be interpreted by the DynASM runtime; it intermixes a direct encoding of our assembly language instructions with actions that the DynASM runtime uses to link the code and insert our runtime values. In this case, the four bytes in our action list are interpreted as:
  • 184 -- the first byte of an x86 mov eax, [immediate] instruction.
  • 237 -- the DynASM bytecode instruction DASM_IMM_D, which indicates that the next argument to dasm_put() should be written as a four-byte value. This will complete the mov instruction.
  • 195 -- the x86 encoding of the ret instruction.
  • 255 -- the DynASM bytecode instruction DASM_STOP, which indicates that encoding should halt.
This action buffer is then referenced by the parts of the code that actually emit assembly instructions. These instruction-emitting lines are replaced with a call to dasm_put() that provides an offset into the action buffer and passes any runtime values that need to be substituted into the output (like our runtime value of num). dasm_put() will append these instructions (with our runtime value of num) into the buffer stored in state (see the #define Dst &state define above).

The result is that we get exactly the same effect as our first example, but this time we're using an approach lets us write assembly language symbolically. This is a much nicer way of programming a JIT.

A Simple JIT for Brainf*ck

The simplest Turing-complete language we could target would have to be the colorfully-named Brainf*ck (hereafter "BF"). BF manages to be Turing-complete (and even include I/O) in only eight commands. These commands can be thought of as a kind of byte code.

Without much more sophistication than our last example, we can have a fully-functional JIT for BF in under 100 lines of C (excluding our ~70-line shared driver file):
#include <stdint.h>

|.arch x64
|.actionlist actions
|
|// Use rbx as our cell pointer.
|// Since rbx is a callee-save register, it will be preserved
|// across our calls to getchar and putchar.
|.define PTR, rbx
|
|// Macro for calling a function.
|// In cases where our target is <=2**32 away we can use
|//   | call &addr
|// But since we don't know if it will be, we use this safe
|// sequence instead.
|.macro callp, addr
|  mov64  rax, (uintptr_t)addr
|  call   rax
|.endmacro

#define Dst &state
#define MAX_NESTING 256

void err(const char *msg) {
  fprintf(stderr, "%s\n", msg);
  exit(1);
}

int main(int argc, char *argv[]) {
  if (argc < 2) err("Usage: jit3 <bf program>");
  dasm_State *state;
  initjit(&state, actions);

  unsigned int maxpc = 0;
  int pcstack[MAX_NESTING];
  int *top = pcstack, *limit = pcstack + MAX_NESTING;

  // Function prologue.
  |  push PTR
  |  mov  PTR, rdi

  for (char *p = argv[1]; *p; p++) {
    switch (*p) {
      case '>':
        |  inc  PTR
        break;
      case '<':
        |  dec  PTR
        break;
      case '+':
        |  inc  byte [PTR]
        break;
      case '-':
        |  dec  byte [PTR]
        break;
      case '.':
        |  movzx edi, byte [PTR]
        |  callp putchar
        break;
      case ',':
        |  callp getchar
        |  mov   byte [PTR], al
        break;
      case '[':
        if (top == limit) err("Nesting too deep.");
        // Each loop gets two pclabels: at the beginning and end.
        // We store pclabel offsets in a stack to link the loop
        // begin and end together.
        maxpc += 2;
        *top++ = maxpc;
        dasm_growpc(&state, maxpc);
        |  cmp  byte [PTR], 0
        |  je   =>(maxpc-2)
        |=>(maxpc-1):
        break;
      case ']':
        if (top == pcstack) err("Unmatched ']'");
        top--;
        |  cmp  byte [PTR], 0
        |  jne  =>(*top-1)
        |=>(*top-2):
        break;
    }
  }

  // Function epilogue.
  |  pop  PTR
  |  ret

  void (*fptr)(char*) = jitcode(&state);
  char *mem = calloc(30000, 1);
  fptr(mem);
  free(mem);
  free_jitcode(fptr);
  return 0;
}
In this program we really see the DynASM approach shine. The way we can intermix C and assembly makes for a beautifully readable code generator.

Compare this with the code for the Berkeley Packet Filter JIT, which I mentioned earlier. Its code generator has a similar structure (a big switch() statement with byte-codes as cases), but without DynASM the code has to specify the instruction encodings manually. The symbolic instructions themselves are included only as comments, which the reader has to assume are correct. From arch/x86/net/bpf_jit_comp.c in the Linux kernel:
    switch (filter[i].code) {
    case BPF_S_ALU_ADD_X: /* A += X; */
            seen |= SEEN_XREG;  
            EMIT2(0x01, 0xd8);              /* add %ebx,%eax */ 
            break;
    case BPF_S_ALU_ADD_K: /* A += K; */
            if (!K)
                    break;              
            if (is_imm8(K))     
                    EMIT3(0x83, 0xc0, K);   /* add imm8,%eax */ 
            else
                    EMIT1_off32(0x05, K);   /* add imm32,%eax */
            break;
    case BPF_S_ALU_SUB_X: /* A -= X; */
            seen |= SEEN_XREG;  
            EMIT2(0x29, 0xd8);              /* sub    %ebx,%eax */ 
            break;
This JIT seems like it would benefit a lot from using DynASM, but there may be externalities that would prevent this. For example, the build-time dependency on Lua may be unacceptable to the Linux people. If the preprocessed DynASM file were checked into Linux's git repository, this would avoid the need for Lua unless the JIT were actually being modified, but perhaps even this is too much for Linux's build system standards. In any case, our approach compares very favorably to this.

There are a few things I should explain about our BF JIT, since it does use a few more features of DynASM than the previous example. First, you'll notice we've used a .define directive that aliases PTR to the register rbx. This is a nice bit of indirection that lets us specify our register allocation up-front and then refer to registers symbolically. This requires a bit of care though; any code that refers to both PTR and rbx will obscure the fact that both are the same register! In a JIT I've been working on I ran into a tricky bug like this at least once.

Secondly, you'll see that I have defined a DynASM macro with .macro. A macro is a set of DynASM lines that will be substituted into any code that invokes the macro.

The last new DynASM feature we see here is pclabels. DynASM supports three different kinds of labels that we can use as branch targets; pclabels are the most flexible because we can adjust how many there are at runtime. Every pclabel is identified by an unsigned int that is used both to define the label and to jump to it. Each label must be in the range [0, maxpc), but we can grow maxpc by calling dasm_growpc(). DynASM stores the pclabels as a dynamic array, but we don't have to worry about growing it too often because DynASM grows the allocation exponentially. DynASM pclabels are defined and referenced with the syntax =>labelnum, where labelnum can be an arbitrary C expression.

One final note about our BF JIT. Our generated code is very simple and elegant, and should be very efficient, but is not maximally efficient. In particular, since we don't have a register allocator, we always read and write cell values straight from memory instead of caching them in registers. If we needed to squeeze out even more performance, we would want an approach that does perform register allocation and other optimizations. To compare the relative performance gains of various approaches, I ran a quick and dirty benchmark across several different BF implementations:
  • brainf*ck.c, a simple, non-optimizing interpreter written in C.
  • bff, a "moderately optimizing brainf*ck interpreter"
  • bf2c.hs, a BF to C compiler, which I then compiled with gcc (which performs register allocation and other optimizations).

For my test program I used mandelbrot.bf, which prints a text rendering of the Mandelbrot set. The results I got were:

BF implementationTime
brainf*ck.c1m0.541s
bff6.166s
bf2c1.244s
jit3 (our JIT)3.745s


So while our JIT did beat the optimizing interpreter by about 65%, it was no match for the optimizing compiler. DynASM is still absolutely suitable for even the highest-performance JITs (like LuaJIT), but to be that fast you have to be more aggressive with the optimizations you perform prior to the code generation step.

Conclusion

I had originally intended to provide one more example: a JIT for the ICFP 2006 contest, which described a virtual machine specification called the Universal Machine that was supposedly used by a fictitious ancient society of programmers called "The Cult of the Bound Variable." This problem has been a favorite of mine for a while, and was an early influence that helped to pique my interest in virtual machines. It is such a fun problem that I really want to write a JIT for it someday.

Unfortunately I've already spent too long on this article, and have run into roadblocks (like the reference specification of the Universal Machine from the technical report is crashing on me, which would make performance comparisons difficult). This would also be a significantly more complicated undertaking primarily because of the fact that this virtual machine allows self-modifying code. BF was easy because code and data were separate and it is impossible to modify the program while it is executing. If self-modifying code is allowed, you have to re-JIT code when it changes, which can be particularly difficult if you're trying to patch new code into an existing code sequence. There are certainly ways of doing this, it's just a more complicated undertaking that will have to be a separate blog article someday.

So while I won't be bringing you a JIT for the Universal Machine today, you can check out an existing implementation that uses DynASM already. It's for 32-bit x86, not x86-64, and has other limitations as described in its README, but it can give you a sense for what the problem is like and some of the difficulties of self-modifying code.

There are also many more features of DynASM that we have not covered. One particularly novel feature is typemaps, which let you symbolically compute effective addresses of structure members (for example, if you had a struct timeval* in a register, you could compute the effective address of the member tv_usec by writing TIMEVAL->tv_usec). This makes it much easier to interoperate with C-based data structures from your generated assembly.

DynASM is a really beautiful piece of work, but doesn't have much documentation -- you have to be resourceful and learn by example. I hope this article will lessen the learning curve a bit, as well as demonstrate that JITs really can have "Hello, World" programs that do something interesting and useful in a very small amount of code. And for the right kind of person, they can be a lot of fun to write too.

Thursday, December 27, 2012

Testing for Integer Overflow in C and C++

Update: Thanks to Mihai Rusu who discovered that my "correct" functions had a bug in them; I have updated them to fix the bug.

Update2: Thanks to Kevin Bailey who discovered that my final functions had a redundant comparison in them. I have updated them to remove it.

Update2: Thanks to Ami Fischman who pointed me to this code in the Chromium tree that has implemented this functionality in a way that appears to avoid all of the compiler warnings I was getting: src/base/safe_numerics.h.

Integer overflow (and underflow -- I'll lump them together) is one of those pesky things that creeps up in the real world and makes low-level software a little less clean and elegant than what you might see in an algorithms textbook. Checking for overflow is one of those things that distinguishes production-quality software from toy code. If you program in C or C++, it's something you should always be aware of, especially since it can be a security issue.

If you're written much low-level software that deals with buffers or offsets, you've probably already come across this pattern for testing if an addition will overflow:
bool WillOverflow_BAD(uintmax_t x, int add) {
  // XXX -- doesn't work!
  // Overflow causes wraparound and always returns false
  return x + add > UINTMAX_MAX;
}

bool WillOverflow_GOOD(uintmax_t x, int add) {
  // This won't overflow because "x" can't be greater than UINTMAX_MAX
  return UINTMAX_MAX - x < add;
}
If you compile the first example in optimized mode and look at its output, you will see that the function has been optimized to simply return 0 without looking at its arguments at all. Our attempt to check for overflow has been completely defeated. But the second version works as intended. Here the overflow had an easy fix.

The fix is not so easy if we want to test whether a given value can be directly converted to a given data type without overflow. In fact this problem is surprisingly hard. But let's start with a problem statement. I'll write it in C++ because templates are convenient, but the issue applies equally to C:
#include <assert.h>
#include <limits.h>

// This is the function we want: it returns true if
// converting "val" to type "To" will overflow or underflow.
template<typename To, typename From>
bool WillOverflow(From val);

int main() {
  assert(!WillOverflow<int>(0));
  assert(!WillOverflow<unsigned>(0));
  assert(WillOverflow<unsigned>(-1));
  assert(WillOverflow<unsigned>(-1LL));
  assert(!WillOverflow<long>(-1));
  assert(WillOverflow<long>(ULONG_MAX));
  assert(!WillOverflow<unsigned long>(ULONG_MAX));
  assert(WillOverflow<unsigned long>(LONG_MIN));
  assert(!WillOverflow<long>(LONG_MAX));
  return 0;
}
You already know the naive solution isn't going to work, so let's just get it out of the way:
template<typename To, typename From>
bool WillOverflow(From val) {
  // DOESN'T WORK (see below).
  return (From)(To)val != val;
}
The theory here is that if we can round-trip to the destination type and back, then the destination type can represent our target value. But alas, our test program fails!
$ ./overflow 
overflow: overflow.cc:17: int main(): Assertion `WillOverflow<unsigned>(-1)' failed.
Aborted (core dumped)
What happened? Well it turns out that on my platform, -1 can round-trip to unsigned just fine. I'm on a two's complement system and converting the integer -1 to unsigned yields 0xFFFFFFFF (also known as UINT_MAX), which becomes -1 when converted back to int. But just because the value could round-trip to my destination type doesn't mean it has the correct value in my destination type. So our first attempt is unsuccessful.

(Note that although this example uses C++ and templates, the lesson equally applies to an attempt in C to test (int)(unsigned)x == x).

If round-tripping isn't a good test, perhaps we can explicitly test that our value is inside the range of our destination type.
template<typename To, typename From>
bool WillOverflow(From val) {
  // DOESN'T WORK (see below).
  return val < std::numeric_limits<To>::min() ||
         val > std::numeric_limits<To>::max();
}
But this does no better; it fails on exactly the same assertion, but this time for a different reason. In our failure case the comparisons are mixed-type comparisons, meaning that the operands are of different types (in this case int and unsigned). How does C++ handle mixed-type comparisons? The hardware can only really compare two values of the same type, so which type do both operands get converted to before the comparison happens?

This is decided by what the C and C++ standards call the usual arithmetic conversions. The usual arithmetic conversions are more complicated than I want to inaccurately summarize here, but in this case they say that -1 is converted to unsigned before being compared in the unsigned domain. So naturally the resulting value will be in range for the unsigned data type.

The "usual arithmetic conversions" are confounding our attempts to ask questions about the original value. So we need to avoid them by ensuring that both sides of our comparison are the same type. But there is no numeric type in C or C++ that can represent all integral values (no type can represent both LLONG_MIN and ULLONG_MAX), so what domain can we use for our range comparisons?

You might be tempted to reach for two's complement tricks like testing the sign bit of signed types, but resist the urge: the standard does not guarantee two's complement arithmetic, and besides, the real solution is more principled anyway:
#include <limits>
#include <cstdint>  // Requires c++11, needed for intmax_t/uintmax_t.

template<typename To, typename From>
bool WillOverflow(From val) {
  assert(std::numeric_limits<From>::is_integer);
  assert(std::numeric_limits<To>::is_integer);
  if (std::numeric_limits<To>::is_signed) {
    return (!std::numeric_limits<From>::is_signed &&
              (uintmax_t)val > (uintmax_t)INTMAX_MAX) ||
           (intmax_t)val < (intmax_t)std::numeric_limits<To>::min() ||
           (intmax_t)val > (intmax_t)std::numeric_limits<To>::max();
  } else {
    return val < 0 ||
           (uintmax_t)val > (uintmax_t)std::numeric_limits<To>::max();
  }
}
The types intmax_t and uintmax_t are defined in C99 and C++11 as types capable of representing any value of any signed/unsigned integer type, respectively. So if we do our comparisons using those types, we should be able to avoid getting overflow before the comparison is even performed. By casting both sides of the expression to the same type, we avoid the usual arithmetic conversions. But we still have to handle the case where we are converting a signed type to an unsigned one or vice-versa. If we are converting int x to an unsigned type, we need to check whether x < 0 first, before doing our uintmax_t conversions, because once we convert a negative number to uintmax_t it will become a large signed number and appear in range.

So finally we have a solution that works and passes our initial test cases. But the implementation still has one small wart; it throws warnings like the following:
$ clang -std=c++0x -o overflow overflow.cc
overflow.cc:14:16: warning: comparison of unsigned expression < 0 is always false
      [-Wtautological-compare]
    return val < 0 ||
           ~~~ ^ ~
This check exists for when we are converting from a signed type; we want and expect it to be a no-op when converting from an unsigned type. But clang doesn't know we know this so it warns us anyway. How can we make this warning go away? The warning is thrown even if the comparison is unreachable, so changing the expression to explicitly check whether the type is signed first does not help:
    // Doesn't help -- still throws a warning:
    return (std::numeric_limits<From>::is_signed && val < 0) ||
It seems that the only solution to this is to use template partial specialization to ensure that that the val < 0 comparison is only generated for signed types. And functions cannot be partially specialized (only classes) so we need to use a functor. Getting around this warning unfortunately bloats the code significantly, but the solution below does work and avoids the warning.
#include <limits>
#include <cstdint>
#include <assert.h>

template<bool is_signed, typename T>
class IsNegativeFunctor;

template<typename T>
class IsNegativeFunctor<true, T> {
 public:
  bool operator()(T x) {
    return x < 0;
  }
};

template<typename T>
class IsNegativeFunctor<false, T> {
 public:
  bool operator()(T x) {
    // Unsigned type is never negative.
    return false;
  }
};

template<typename T>
bool IsNegative(T x) {
  return IsNegativeFunctor<std::numeric_limits<T>::is_signed, T>()(x);
}

template<typename To, typename From>
bool WillOverflow(From val) {
  assert(std::numeric_limits<From>::is_integer);
  assert(std::numeric_limits<To>::is_integer);
  if (std::numeric_limits<To>::is_signed) {
    return (!std::numeric_limits<From>::is_signed &&
              (uintmax_t)val > (uintmax_t)INTMAX_MAX) ||
           (intmax_t)val < (intmax_t)std::numeric_limits<To>::min() ||
           (intmax_t)val > (intmax_t)std::numeric_limits<To>::max();
  } else {
    return IsNegative(val) ||
           (uintmax_t)val > (uintmax_t)std::numeric_limits<To>::max();
  }
}

Thursday, November 22, 2012

dumpfp: A Tool to Inspect Floating-Point Numbers

Floating-point math has a reputation for being very unpredictable and hard to understand. I think that a major reason for this is that floating-point values are hard to inspect. Take the following program:
#include <stdio.h>

int main() {
  double x = 0.1;
  printf("%f\n", x);
}
Here is the output I get on my system:
0.100000
You might be tempted to think that the variable x is indeed equal to the value 0.1, but don't be fooled! In fact 0.1 is a rounded version of x's true value, which will become apparent if we ask for more precision:
#include <stdio.h>

int main() {
  double x = 0.1;
  printf("%.30f\n", x);
}
0.100000000000000005551115123126
It's hard to understand what we can't easily inspect. To remedy this, I've just written a new tool called dumpfp. Thing of it as the floating-point toString() method you never had. It prints the precise value of the number, in both rational and decimal forms:
$ ./dumpfp 0.1
Single Precision (IEEE 32-bit):
           raw = 0x3dcccccd
          sign = 0x0
      exponent = 0x7b (-4)
   significand = 0x4ccccd

   VALUE CALCULATION =
       significand   (1 + 5033165/2^23  (1.60000002384185791015625))
     * 2^exponent    (2^-4)
     = VALUE         (13421773/2^27  (0.100000001490116119384765625))

Double Precision (IEEE 64-bit):
           raw = 0x3fb999999999999a
          sign = 0x0
      exponent = 0x3fb (-4)
   significand = 0x999999999999a

   VALUE CALCULATION =
       significand   (1 + 1351079888211149/2^51  (1.600000000000000088817841970012523233890533447265625))
     * 2^exponent    (2^-4)
     = VALUE         (3602879701896397/2^55  (0.1000000000000000055511151231257827021181583404541015625))
Notice that the actual value was not exactly 0.1, because binary floating point can only exactly represent rational values with a power-of-two denominator. You can see that the approximation is closer for double than it is for float

The tool first breaks down the raw bytes of the value into its constituent parts. IEEE floating-point values consist of a sign bit, some number of exponent bits, followed by a significand. These values are then combined together according to the expression significand * 2^exponent. The tool will show you all of the intermediate values and the final result

You'll notice that these numbers have many decimal digits. Not all of these are significant. We can calculate how many digits are significant but I couldn't think of an easy-to-read way of printing the exact value but also indicating which digits are significant. If anyone has a bright idea here, please do drop me a line (or fork me on GitHub).

The tool's output is less noisy for a value that can be represented exactly:
$ ./dumpfp 0.5
Single Precision (IEEE 32-bit):
           raw = 0x3f000000
          sign = 0x0
      exponent = 0x7e (-1)
   significand = 0x0

   VALUE CALCULATION =
       significand   (1 + 0/2^0  (1.0))
     * 2^exponent    (2^-1)
     = VALUE         (1/2^1  (0.5))

Double Precision (IEEE 64-bit):
           raw = 0x3fe0000000000000
          sign = 0x0
      exponent = 0x3fe (-1)
   significand = 0x0

   VALUE CALCULATION =
       significand   (1 + 0/2^0  (1.0))
     * 2^exponent    (2^-1)
     = VALUE         (1/2^1  (0.5))
You can also use it for special values like NaN or Infinity:
$ ./dumpfp nan
Single Precision (IEEE 32-bit):
           raw = 0x7fc00000
          sign = 0x0
      exponent = 0xff (NaN or Infinity)
   significand = 400000 (non-zero indicates NaN)


Double Precision (IEEE 64-bit):
           raw = 0x7ff8000000000000
          sign = 0x0
      exponent = 0x7ff (NaN or Infinity)
   significand = 8000000000000 (non-zero indicates NaN)

And for very large values it will print the full integer value:
$ ./dumpfp 1e30
Single Precision (IEEE 32-bit):
           raw = 0x7149f2ca
          sign = 0x0
      exponent = 0xe2 (99)
   significand = 0x49f2ca

   VALUE CALCULATION =
       significand   (1 + 2423141/2^22  (1.5777218341827392578125))
     * 2^exponent    (2^99)
     = VALUE         (1000000015047466219876688855040)

Double Precision (IEEE 64-bit):
           raw = 0x46293e5939a08cea
          sign = 0x0
      exponent = 0x462 (99)
   significand = 0x93e5939a08cea

   VALUE CALCULATION =
       significand   (1 + 1300913865115253/2^51  (1.577721810442023642195863430970348417758941650390625))
     * 2^exponent    (2^99)
     = VALUE         (1000000000000000019884624838656)
I learned a lot by writing this tool, and I hope it helps you understand floating-point better. Floating-point numbers don't have to be that mysterious: they have very specific values as we can see. The trickiness comes from the fact that values get rounded if they can't be represented exactly; hopefully this tool will make it clear when a value can be represented or not.

For example, one thing that can run you into trouble is trying to add two numbers where one is much larger than the other. Suppose you wanted to add 1 + 1e-16. The result should be 1.00000000000000001, but can this number be represented in floating-point?
$ ./dumpfp 1.00000000000000001
Single Precision (IEEE 32-bit):
           raw = 0x3f800000
          sign = 0x0
      exponent = 0x7f (0)
   significand = 0x0

   VALUE CALCULATION =
       significand   (1 + 0/2^0  (1.0))
     * 2^exponent    (2^0)
     = VALUE         (1 + 0/2^0  (1.0))

Double Precision (IEEE 64-bit):
           raw = 0x3ff0000000000000
          sign = 0x0
      exponent = 0x3ff (0)
   significand = 0x0

   VALUE CALCULATION =
       significand   (1 + 0/2^0  (1.0))
     * 2^exponent    (2^0)
     = VALUE         (1 + 0/2^0  (1.0))
We see here that the number 1.00000000000000001 can't be represented in floating-point, and the closest approximation we had available to us was 1.0. We lost the smaller number completely! It's not that operations like addition are imprecise, it's that the result of the operation might not be possible to store in a floating-point value, so it rounds to the nearest representable number.

Sunday, May 13, 2012

Envisioning a non-evil version of Trusted Computing

If you were reading tech blogs in the early 2000's, you probably heard a lot about Trusted Computing (and Microsoft's version of it called "Palladium"). It was the content industry's attempt to make DRM have real teeth on PCs. Normal DRM isn't too hard to circumvent on a traditional operating system, because there are a lot of ways for the user to inject their own code somewhere in the signal path that siphons the data off somewhere else. The code that gets raw access to the "protected" data is not too hard to tamper with.

The Trusted Computing Group, formed in 2003, set out to solve this problem. Through a hardware chip on the motherboard, it effectively allows a record company (for example) to send your computer some music and be guaranteed that only "trusted" software can have access to that data. "Trusted" in this context means "trusted by the record company," and for this to work, every bit of software from the bootloader up through the drivers and music-playing application has to be "trusted." The "untrusted" part of your software stack is denied access to the memory and sound hardware that is used by the trusted signal path. The same mechanisms can also result in lots of other unsavory outcomes like remote censorship.

Thankfully this vision of computer media distribution did not catch on in PCs. The GNU project called it "Treacherous Computing" (which might not be the most brilliant marketing, but certainly was on the right track). Apparently modern PCs do have Trusted Computing chips in them, which is what Microsoft's BitLocker encryption scheme uses to ensure that its software has not been tampered with (ie. for non-evil purposes). But we're thankfully not in a world where your only option for buying music and movies requires an RIAA-approved kernel and sound-card driver.

I was thinking about Trusted Computing recently as I was thinking about Ken Thompson's famous talk Reflections on Trusting Trust, whose central thesis is: "You can't trust code that you did not totally create yourself." Thompson was talking primarily about compilers, but the same applies to the environment (OS, shell, etc) that you are using to compile and/or run programs. A trojan in any one of those layers can compromise your entire system; even if you trust the source code you are running, a trojan in the OS or environment can propagate itself, and worse, hide its very existence. This is the essence of what a rootkit does.

The solution to this problem is presented in David A. Wheeler's very interesting thesis Fully Countering Trusting Trust through Diverse Double-Compiling. To vastly over-simplify, this thesis shows that if you have one trusted environment/compiler, you can use it to extend trust to another environment/compiler. The newly-trusted environment can be larger or more complex than the already-trusted environment, so you can use this to extend trust to more components until you Trusted Computing Base that provides the services that you actually need.

This idea sounded awfully similar to the evil version of Trusted Computing (which verifies all software up from the bootloader), so I set out to find out what made one good and the other evil. I haven't analyzed it exhaustively (and probably don't have the time to do so), but my intuition is that a non-evil form of Trusted Computing is completely possible, and something the industry should pursue.

Palladium-style trusted computing is evil because it is primarily designed to give a third party (namely a record/movie company) more control over a computer than that computer's owner. But what if this were flipped around: what if it were designed to give the computer's owner more control over that system than any of the software running on it?

Imagine if a TPM (Trusted Platform Module) could guarantee that you're not running a rootkit? Imagine if it could tell you exactly which processes in your system were trusted -- not by some third party, but by you as the system's owner? The user interface for this could be something like:
  • When you buy a new computer, the BIOS has a special boot mode that will reset the TPM's private key. It will also generate a new private key for the system's owner and write it to a USB device that can sign random numbers without the key leaving the device. So now the computer's TPM and this USB key mutually trust each other and can set up secure communication channels between them. Unless you mistrust your computer system's hardware, this trust path is secure.
  • Next you install your OS from a CD that you trust, without being connected to the network. None of the software is trusted when first installed (it runs fine, just doesn't have the "trusted" bit set according to the TPM). Once it's all installed you put your USB key in and tell the TPM to trust all of the software you just installed.
  • Once your reboot, your TPM and Trusted Computing-enabled OS cooperate so that your bootloader, OS, and programs all have a "trusted" bit on them that you can check. You can configure your OS to not load any privileged code (like driver os kernel modules) unless you trust it. If you want to upgrade any software in your system, you have to re-insert your USB device. But the upside is that a rootkit cannot load itself unless the USB key is in and you authorize the rootkit to be trusted.
There are obviously a lot of details that I've skimmed over, but if a thing like this could be achieved, it would be a very positive force for helping computer owners keep their systems secure. It could deny rootkits the right to exist. It could give you the integrity guarantees of a highly locked-down system like an iPad without sacrificing the flexibility of a desktop computer.

Right now I have a nagging worry that whoever hacked my Wordpress blog could possibly have gotten their hands on a private key that would have compromised other systems too (it was encrypted with a good password, but could they have had a keylogger installed on my Dreamhost account?). I'll never know for sure that I'm safe unless I completely reinstall the other systems from scratch, but that would be a colossal amount of effort for a worry that is almost certainly baseless. I wish hardware could help me out here, and give me confidence that my OS has not been tampered with.