Deep Wizardry: Stack Unwinding
This weekend I budgeted “a few hours” to learn about stack unwinding, but oh what a journey down the rabbit hole it became. This esoteric topic lies at the intersection of compilers, linkers, loaders, debuggers, ABIs, and language runtimes. There is precious little documentation about it and, from what I can tell, a small handful of guys who answer all the questions about it. People speculate about what happens if Linus Torvalds gets hit by a bus; I think we’re just as screwed if the fellowship of linker/loader wizards bite it (or decide to change careers). But I digress.
I feel compelled to write this article in Q&A style. Plato I’m not, but maybe it’s because there are so many questions to answer.
What is stack unwinding?
In any language that supports recursive functions (ie. pretty much everything except Fortran 77 and Brainf*ck) the language runtime keeps a stack of what functions are currently executing. Stack unwinding is a way of inspecting, and possibly modifying, that stack.
Why would you want to do that?
The answer may seem obvious, but there are several related, yet subtly different, situations where unwinding is useful or necessary:
- As a runtime control-flow mechanism (C++ exceptions, C `longjmp()`, etc).
- In a debugger, to show the user the stack.
- In a profiler, to take a sample of the stack.
- From the program itself (like from a crash handler to show the stack).
These have subtly different requirements. Some of these are performance-critical, some are not. Some require the ability to reconstruct registers from outer frame, some do not. But we’ll get into all that in a second.
Ok, lay it on me: how do you unwind the stack?
Let’s start with the simplest case, which is the longjmp()
function
from C (see The
Wikipedia entry for setjmp.h for the basics). longjmp()
jumps
down the stack to the corresponding setjmp()
, across as many stack
frames as there are in between. Unlike the other approaches we will
see, longjmp()
does not iterate over the list of frames, it just
blasts through them all at once with wanton disregard for anything
precious or beautiful that they might have contained.
The implementations of setjmp()
and longjmp
are surprisingly
simple, though heavily architecture specific. For example, let’s take
a look at the
x86-64 implementation of setjmp/longjmp
from the FreeBSD kernel
(you can look at the
userspace version from libc too, but it is slightly more
complicated because it also has to save/restore other status
registers). I’ve converted this to Intel syntax:
Maybe it’s just me, but I expected these functions to look way more
gnarly. All setjmp
is doing is saving a bunch of registers
including rsp
(the stack pointer) and the return address into the
jmp_buf
array that was passed as a parameter. All longjmp
is
doing is restoring those registers and the return address of the
original setjmp
call. This tweaking of the return address is how
the call to longjmp
manages to “return from” the original setjmp
call.
This technique is simple, but so primitive that you might not even
call it “unwinding.” We are just setting the stack pointer rsp
to
its old value, which instantly discards the entire contents of the
stack between the longjmp
and setjmp
calls. Any resources that
were stored there are leaked! Any other data that was in the middle
of being modified might be corrupted, unless you took care to leave it
in a consistent state before making the function call that ultimately
called longjmp
.
What about the other registers, like rax
?
Good eye! The x86-64 architecture has 16 general-purpose registers,
but we only saved 7 of them. What about the others? Well recall that
setjmp
and longjmp
are implemented as functions and as such they
follow the standard x86-64 calling convention. The x86-64 calling
convention dictates that the 7 registers above are owned by the caller
(also known as callee-save), which means that setjmp
is
responsible for restoring these registers before it returns. The
other registers are owned by the callee function, the callee is
allowed to clobber them however it likes before returning. So we’re
under no obligation to restore these registers before returning, and
the caller can’t make any assumptions about what they will contain.
Who decides these calling conventions?
Probably the same gurus who answer all my linker/loader questions. In my mind they all sit in a circle and meditate, reflecting on ancient texts from The Cult of the Bound Variable.
Ok, that makes sense, stack unwinding doesn’t seem so bad.
That’s because I started with the simplest case. Every other stack-unwinding mechanism has to walk the stack, frame-by-frame.
How do you walk the list of frames?
There are two main ways. I’ll start with the simpler one. If you’ve ever looked at a function in assembly, you might have seen a prologue and epilogue that look like this:
I remember when I was taking assembly language in college I could
not understand why I had to do this. What’s the point of having
both a base pointer and a stack pointer, both pointing at the stack?
As long as you adjust the stack pointer back to where it started
before you call ret
you should be fine, right?
It turns out this convention comes in really handy when you want to
walk the stack (like to generate a backtrace), because it connects all
the stack frames together into a linked list whose head pointer is
rbp
! This means that generating the back trace is really as simple
as a linked list traversal.
Aha! So that’s how debugger backtraces and backtrace()
from libc
work!
Actually… no. But that’s how they used to work! On 32-bit x86, it used to be that everyone compiled with a frame pointer, because this was the only way of getting stack traces in the debugger. But this scheme has two significant overheads:
- One general-purpose register is used up at all times. On x86, this meant 1 out of only 8 registers was gone. On x86-64, it would be 1 out of 16.
- Every function has to have this prologue and epilogue I showed above.
To avoid these overheads, the people who were developing the ABI for x86-64 decided to no longer require a base pointer in every function. Here is some more info from someone who was not thrilled with that decision. (Apparently it caused problems for DTrace). And as of recently (around GCC 4.6, from what I’m told), omitting the frame pointer has become the default on 32-bit x86 too, which frees up a register and improves performance.
How can you walk up the stack without a base pointer?
I know, it seems hopeless. Without a base pointer stored in each stack frame, how do you find the previous frame?
What the tools do nowadays is store some information in a different part of the file that provides “unwind information.” Unwind information tells you, for an arbitrary instruction:
- how much the stack pointer has been adjusted (or equivalently, how much the unwinder has to adjust it back to get to the base of the frame, which is necessary to find the return address).
- where in the frame the callee-save registers are stored, if they have been clobbered since the function began.
- (optionally) a pointer to a custom “personality function” for unwinding the stack
- lots of other things? I don’t fully understand this yet.
While more complicated, this gives us a capability we didn’t have with the simple linked-list approach: we can restore the callee-save registers of previous frames, so a debugger can show local variables that were saved in registers.
That sounds useful! Who defines the format of debug information?
Debug information on most modern UNIX-like OS’s is stored in the DWARF format (a clever
pun on the ELF
object file format). DWARF specifies that this unwind information
should go in a section called .debug_frame
.
I see, so the modern way to specify unwind information is in the
.debug_frame
section.
That would be too easy. When the x86-64 ABI was
being developed, they decided to specify that debug information would
go in the .eh_frame
section, in a slightly different format than is
defined by DWARF. .eh_frame
is what all the tools appear to use in
practice. And there is a standard, portable library for unwinding
based on the .eh_frame
/.debug_frame
mechanism called libunwind.
Gotcha, so gdb and g++’s exception support use libunwind?
No, they have their own unwinder implementations that are completely independent of libunwind (gcc’s unwind-dw2.c and gdb’s dwarf2-frame.c). Gcc can link against libunwind but uses its own unwinder by default. gdb has a whole bunch of unwinders, including one that scans the machine code to figure out how the frame was setup.
Ok, so what about C++ exceptions?
C++ exceptions use .eh_frame
information also. They use this
information to not only recreate the frame, but execute code in it
(C++ destructors and catch
blocks) and finally to dismantle it
before transferring control to the frame with the matching catch
block.
How on Earth does that work?
I don’t really know, but when I find out I’ll post another blog entry.