Tuesday, May 14, 2013

Deep Wizardry: Stack Unwinding

This weekend I budgeted "a few hours" to learn about stack unwinding, but oh what a journey down the rabbit hole it became. This esoteric topic lies at the intersection of compilers, linkers, loaders, debuggers, ABIs, and language runtimes. There is precious little documentation about it and, from what I can tell, a small handful of guys who answer all the questions about it. People speculate about what happens if Linus Torvalds gets hit by a bus; I think we're just as screwed if the fellowship of linker/loader wizards bite it (or decide to change careers). But I digress.

I feel compelled to write this article in Q&A style. Plato I'm not, but maybe it's because there are so many questions to answer.

What is stack unwinding?

In any language that supports recursive functions (ie. pretty much everything except Fortran 77 and Brainf*ck) the language runtime keeps a stack of what functions are currently executing. Stack unwinding is a way of inspecting, and possibly modifying, that stack.

Why would you want to do that?

The answer may seem obvious, but there are several related, yet subtly different, situations where unwinding is useful or necessary:
1. As a runtime control-flow mechanism (C++ exceptions, C longjmp(), etc).
2. In a debugger, to show the user the stack.
3. In a profiler, to take a sample of the stack.
4. From the program itself (like from a crash handler to show the stack).
These have subtly different requirements. Some of these are performance-critical, some are not. Some require the ability to reconstruct registers from outer frame, some do not. But we'll get into all that in a second.

Ok, lay it on me: how do you unwind the stack?

Let's start with the simplest case, which is the longjmp() function from C (see The Wikipedia entry for setjmp.h for the basics). longjmp() jumps down the stack to the corresponding setjmp(), across as many stack frames as there are in between. Unlike the other approaches we will see, longjmp() does not iterate over the list of frames, it just blasts through them all at once with wanton disregard for anything precious or beautiful that they might have contained.

The implementations of setjmp() and longjmp are surprisingly simple, though heavily architecture specific. For example, let's take a look at the x86-64 implementation of setjmp/longjmp from the FreeBSD kernel (you can look at the userspace version from libc too, but it is slightly more complicated because it also has to save/restore other status registers). I've converted this to Intel syntax:
_setjmp:
mov    QWORD PTR [rdi],rbx        ; save rbx
mov    QWORD PTR [rdi+0x8],rsp    ; save rsp
mov    QWORD PTR [rdi+0x10],rbp   ; save rbp
mov    QWORD PTR [rdi+0x18],r12   ; save r12
mov    QWORD PTR [rdi+0x20],r13   ; save r13
mov    QWORD PTR [rdi+0x28],r14   ; save r14
mov    QWORD PTR [rdi+0x30],r15   ; save r15
mov    rdx,QWORD PTR [rsp]        ; get return address
mov    QWORD PTR [rdi+0x38],rdx   ; save return address
xor    eax,eax                ; return 0
ret

_longjmp:
mov    rbx,QWORD PTR [rdi]        ; restore rbx
mov    rsp,QWORD PTR [rdi+0x8]    ; restore rsp
mov    rbp,QWORD PTR [rdi+0x10]   ; restore rbp
mov    r12,QWORD PTR [rdi+0x18]   ; restore r12
mov    r13,QWORD PTR [rdi+0x20]   ; restore r13
mov    r14,QWORD PTR [rdi+0x28]   ; restore r14
mov    r15,QWORD PTR [rdi+0x30]   ; restore r15
mov    rdx,QWORD PTR [rdi+0x38]   ; get return address
mov    QWORD PTR [rsp],rdx        ; restore return address
xor    eax,eax
inc    eax                    ; return 1
ret

Maybe it's just me, but I expected these functions to look way more gnarly. All setjmp is doing is saving a bunch of registers including rsp (the stack pointer) and the return address into the jmp_buf array that was passed as a parameter. All longjmp is doing is restoring those registers and the return address of the original setjmp call. This tweaking of the return address is how the call to longjmp manages to "return from" the original setjmp call.

This technique is simple, but so primitive that you might not even call it "unwinding." We are just setting the stack pointer rsp to its old value, which instantly discards the entire contents of the stack between the longjmp and setjmp calls. Any resources that were stored there are leaked! Any other data that was in the middle of being modified might be corrupted, unless you took care to leave it in a consistent state before making the function call that ultimately called longjmp.

What about the other registers, like rax?

Good eye! The x86-64 architecture has 16 general-purpose registers, but we only saved 7 of them. What about the others? Well recall that setjmp and longjmp are implemented as functions and as such they follow the standard x86-64 calling convention. The x86-64 calling convention dictates that the 7 registers above are owned by the caller (also known as callee-save), which means that setjmp is responsible for restoring these registers before it returns. The other registers are owned by the callee function, the callee is allowed to clobber them however it likes before returning. So we're under no obligation to restore these registers before returning, and the caller can't make any assumptions about what they will contain.

Who decides these calling conventions?

Probably the same gurus who answer all my linker/loader questions. In my mind they all sit in a circle and meditate, reflecting on ancient texts from The Cult of the Bound Variable.

Ok, that makes sense, stack unwinding doesn't seem so bad.

That's because I started with the simplest case. Every other stack-unwinding mechanism has to walk the stack, frame-by-frame.

How do you walk the list of frames?

There are two main ways. I'll start with the simpler one. If you've ever looked at a function in assembly, you might have seen a prologue and epilogue that look like this:
  push  rbp
mov   rbp, rsp
;  other stuff...
pop   rbp
ret

I remember when I was taking assembly language in college I could not understand why I had to do this. What's the point of having both a base pointer and a stack pointer, both pointing at the stack? As long as you adjust the stack pointer back to where it started before you call ret you should be fine, right?

It turns out this convention comes in really handy when you want to walk the stack (like to generate a backtrace), because it connects all the stack frames together into a linked list whose head pointer is rbp! This means that generating the back trace is really as simple as a linked list traversal.

Aha! So that's how debugger backtraces and backtrace() from libc work!

Actually... no. But that's how they used to work! On 32-bit x86, it used to be that everyone compiled with a frame pointer, because this was the only way of getting stack traces in the debugger. But this scheme has two significant overheads:
1. One general-purpose register is used up at all times. On x86, this meant 1 out of only 8 registers was gone. On x86-64, it would be 1 out of 16.
2. Every function has to have this prologue and epilogue I showed above.
To avoid these overheads, the people who were developing the ABI for x86-64 decided to no longer require a base pointer in every function. Here is some more info from someone who was not thrilled with that decision. (Apparently it caused problems for DTrace). And as of recently (around GCC 4.6, from what I'm told), omitting the frame pointer has become the default on 32-bit x86 too, which frees up a register and improves performance.

How can you walk up the stack without a base pointer?

I know, it seems hopeless. Without a base pointer stored in each stack frame, how do you find the previous frame?

What the tools do nowadays is store some information in a different part of the file that provides "unwind information." Unwind information tells you, for an arbitrary instruction:
1. how much the stack pointer has been adjusted (or equivalently, how much the unwinder has to adjust it back to get to the base of the frame, which is necessary to find the return address).
2. where in the frame the callee-save registers are stored, if they have been clobbered since the function began.
3. (optionally) a pointer to a custom "personality function" for unwinding the stack
4. lots of other things? I don't fully understand this yet.
While more complicated, this gives us a capability we didn't have with the simple linked-list approach: we can restore the callee-save registers of previous frames, so a debugger can show local variables that were saved in registers.

That sounds useful! Who defines the format of debug information?

Debug information on most modern UNIX-like OS's is stored in the DWARF format (a clever pun on the ELF object file format). DWARF specifies that this unwind information should go in a section called .debug_frame.

I see, so the modern way to specify unwind information is in the .debug_frame section.

That would be too easy. When the x86-64 ABI was being developed, they decided to specify that debug information would go in the .eh_frame section, in a slightly different format than is defined by DWARF. .eh_frame is what all the tools appear to use in practice. And there is a standard, portable library for unwinding based on the .eh_frame/.debug_frame mechanism called libunwind.

Gotcha, so gdb and g++'s exception support use libunwind?

No, they have their own unwinder implementations that are completely independent of libunwind (gcc's unwind-dw2.c and gdb's dwarf2-frame.c). Gcc can link against libunwind but uses its own unwinder by default. gdb has a whole bunch of unwinders, including one that scans the machine code to figure out how the frame was setup.

Ok, so what about C++ exceptions?

C++ exceptions use .eh_frame information also. They use this information to not only recreate the frame, but execute code in it (C++ destructors and catch blocks) and finally to dismantle it before transferring control to the frame with the matching catch block.

How on Earth does that work?

I don't really know, but when I find out I'll post another blog entry.