It’s been nearly four years since I published Parsing Protobuf at 2+GB/s: How I Learned To Love Tail Calls in C. In that article, I presented a technique I co-developed for how to write really fast interpreters through the use of tail calls and the musttail attribute.

While the article focused on Protocol Buffer parsers, the technique applies to many kinds of parsers and VM interpreters. I published the article in the hopes that the technique would catch on and be adopted in other projects.

In the time since that article was published, there have been many exciting developments in this space. I want to take this opportunity to share some updates.

Tail Calling Interpreter for Python

I recently learned that a tail calling interpreter was going through PR review on GitHub. Authored by Ken Jin as part of his Bachelor’s theses, it uses the techniques described in my article and claims a 9-15% improvement geomean improvement on pyperformance, the official Python benchmark suite.

Last Friday that interpreter was merged, and it is officially slated to be released in Python 3.14 (release notes).

Note that the tail call interpreter is not the default yet. It has to be enabled at configuration time with --with-tail-call-interp. Hopefully it will be the default in a future release.

I’m very excited to see this development. In my original article, I made the following prediction:

I think it’s likely that all of the major language interpreters written in C (Python, Ruby, PHP, Lua, etc.) could get significant performance benefits by adopting this technique.

I’m happy to see that this seems to have come true for Python. Congratulations to Ken on this accomplishment.

Tail Calling Interpreter for LuaJIT Remake

In 2022-2024, Haoran Xu published a series of articles and papers about an ambitious and experimental effort to automatically generate interpreters and JIT compilers from a description of the language semantics. This project is called Deegen, and Haoran used it to build LuaJIT Remake, an experimental reimplementation of the famous LuaJIT.

Deegen uses the tail call pattern for its interpreters, arguing that this the best way to get the compiler to generate good code. This conclusion aligns with my 2021 article, and his analysis cites my work.

However, he concludes that C and C++ cannot achieve optimal performance for tail calling interpreters because of limitations in the calling convention. The main problem he cites is around callee-save registers, and how the C calling convention forces all functions to preserve six registers, a steep cost given how small these functions tend to be.

He also mentions the restrictions on [[clang::musttail]], particularly on how the caller and callee must have matching function signatures, which can be inconvenient. As a result, Deegen targets LLVM IR instead of C or C++.

The first issue has been largely addressed by the preserve_none attribute (discussed below). Promising developments are also occurring regarding the second complaint, as detailed below in the “Nonportable Musttail” section.

LuaJIT Remake’s interpreter claims a 31% performance advantage over LuaJIT 2.1, an impressive result given that LuaJIT already has one of the fastest interpreters of any mainstream programming language. It is still an experimental project (for example, garbage collection is unimplemented, so the performance comparison disables GC for LuaJIT too); it would be cool if Deegen or something like it was eventually productionized and used in a mainstream language interpreter. Congratulations to Haoran on this accomplishment.

GCC Support

When I published my original article, GCC had no support for musttail. I floated the idea of adding musttail to the GCC maintainers in 2021, which generated some interesting discussion, but no clear plan.

I’m happy to report that this received renewed attention since then. GCC added support for a musttail attribute in C and in C++ in July of 2024. Thanks to the GCC team for this work.

GCC appears to be more strict than Clang in analyzing whether any pointers to locals are referenced across the tail call. I posted some analysis of this issue in the Clang bug tracker.

Standard C Proposal: “return goto”

I recently heard that there is a proposal to add guaranteed tail calls to the C standard as return goto. The proposal is given in n3266. The proposed syntax is:

int a (int, int);

int b (int x, int y) {
    return goto a (y, x);  // Proposed syntax.
}

It’s very exciting to think that guaranteed tail calls might eventually be added to standard C. It is unclear whether it will be included in C2Y, but I am hopeful.

The rationale given in n2920 does not mention the use case of tail call interpreters, even though this was the primary motivation for adding musttail to both Clang and GCC. The use case envisioned by the committee is that C is being used as a code generation target for a language like Scheme that guarantees tail call optimization. The document goes so far as to say they do not expect the feature to see significant use in user-written C code.

But even though they are not designing for the interpreter use case, the proposed feature looks like a perfect fit for tail calling interpreters, so I’m happy for that.

Interestingly, the proposed C standard is more lax than [[clang::musttail]] about whether the caller and callee have to match in argument types. For various ABI reasons, some calls are impossible to tail call optimize. Clang enforces a set of rules that are intended to be “portable”, so that the tail call can be performed on “any” architecture:

int g();

// Valid according to the proposed standard C feature,
// but the implementation is allowed to fail if this
// cannot be tail-call optimized for an implementation-specific
// reason.
int f(int x) { return goto g(); }

// Always rejected by Clang currently, because the caller
// and callee do not perfectly match in function signature.
// This is rejected even if the implementation is capable of
// tail call optimizing the call, because Clang is trying to
// provide a "portable" guarantee:
int f(int x) { [[clang::musttail]] return g(); }

I think the standard is making a good call here. Clang attempts to ensure “portable” tail calls, but in retrospect, I believe this approach has limitations. We will discuss this further in the next section.

Clang: “Nonportable” Musttail?

When I implemented [[clang::musttail]] in Clang, it merely exposed to C and C++ some functionality that already existed in the LLVM backend. The musttail attribute for LLVM had been added years earlier, and it already had a set of semantics which dictated many things. In particular, it dictated that the caller and callee prototypes must match.

This disallowed code like the following, even though this can be successfully tail call optimized on many platforms:

int g();

// Always rejected by Clang currently, because the caller
// and callee do not perfectly match in function signature.
// This is rejected even if the implementation is capable of
// tail call optimizing the call, because Clang is trying to
// provide a "portable" guarantee:
int f(int x) { [[clang::musttail]] return g(); }

To follow LLVM’s rules, we had to make Clang reject this code.

In theory, we are trading off flexibility for predictability. Even though some platforms are able to tail-call-optimize code like the above, some are not, and it would be surprising if your program worked fine on some platforms but failed to compile on others. So we have a set of rules that, if followed, should guarantee that the code can be tail call optimized on all platforms.

However, once the feature was launched, we started receiving complaints about these rules. There were two main complaints. First, users were understandably frustrated that a tail call would fail to compile on a platform where the tail call was totally possible to optimize. It seems a bit unhelpful for the compiler to reject your program just because it might fail on a different platform.

Users also raised the point that this “guarantee” isn’t really a guarantee at all. Some platforms fundamentally do not support tail calls, like WASM without the Tail Call Extension.

What this means is that the predictability that Clang/LLVM are trying to provide is fundamentally impossible. We’ve traded off flexibility, but we didn’t even get predictability in return.

In response, users have proposed two possible resolutions:

We could relax [[clang::musttail]] so that it only fails if the tail call is impossible on this platform.
We could leave the [[clang::musttail]] attribute the way it is, and introduce a new [[clang::nonportable_musttail]] which has the desired semantics.

I believe option (1) is preferable, though I’m unable to implement it currently. Hopefully, someone will address this bug. Since it requires changes to both Clang and LLVM, it will be a more substantial undertaking than my initial [[clang::musttail]] implementation. However, as it primarily involves relaxing constraints, it should ultimately simplify the implementation.

Calling Convention Improvements

In the original article, I described some limitations where the tail calling technique could result in really bad code. In these scenarios, you could get tons of register spills that were bad for performance and code size. At the time, there weren’t great options to mitigate these problems except to contort the code to never make a non-tail call.

I am happy to report that this problem has been largely solved, at least in Clang. There are two calling conventions that can help. preserve_none optimizes the main interpreter functions that tail-call each other, while preserve_most benefits functions called in non-tail positions.

// preserve_none on tail calling functions optimizes register usage.
__attribute__((preserve_none)) const char *Next(PARAMS);

// preserve_most on non-tail-called functions can help optimize
// the caller.
//
// Caution: preserve_most adds overhead to the target function,
// and may not be worthwhile if the function is hot.
__attribute__((preserve_most)) const char *RegularFunc();

__attribute__((preserve_none)) const char *Parse(PARAMS) {
  if (!ptr) {
    ptr = RegularFunc();
  }
  ptr++;
  __attribute__((musttail)) return Next(ARGS);
}

preserve_most

If we use __attribute__((preserve_most)) on regular functions, it saves the caller from having to shuffle or spill any of its registers prior to the call, which generally will make our tail calling functions smaller and more efficient.

The preserve_most attribute existed when I wrote the original article, but I noted that it seemed to cause unexplainable crashes. This turned out to be a bug in Clang, which was fixed in 2023.

While preserve_most can theoretically solve the register shuffling/spilling problem, it has a major weakness. To get the desired effect, you have to apply it to all functions that you call in non-tail position. Whenever you are modifying your interpreter, if you start calling a new function, you have to add preserve_most to it.¹

But what if you are calling a standard library function like strlen()? You can’t modify the calling convention of a function you don’t own. You would have to create a separate wrapper function like my_strlen() and put preserve_most on the wrapper. If you don’t – if you forget this discipline for even a single function – then you get these horrible pessimized prologues and epilogues in the fast path.

While helpful, preserve_most introduces overhead to the target function and should be used judiciously, particularly in hot code paths.

preserve_none

I had always felt that it was more promising to have a special calling convention for the tail calling functions. After all, they are all under our control and they are “special” already: they all use a consistent set of arguments, and we have to call them with the musttail attribute.

To pursue this opportunity, in 2021 I prototyped a new calling convention I called reverse_call. The calling convention was designed to be applied to the tail calling functions, allowing us to get rid of all those gratuitous prologues and epilogues.

The basic idea was to have no callee-saved registers, so our interpreter functions are freed of the burden of preserving any registers. And critically, we also want our argument registers to be allocated in the opposite order from a normal function (starting with callee-save registers, which are normally never used as arguments at all).² The net effect is that when a reverse_call function calls a normal function, the argument registers for caller and callee do not overlap, and moreover the caller’s argument registers will be preserved by the callee. This allows the interpreter to keep its arguments in registers across the call, without any register shuffling or spilling required. The results from my prototype were good, and I was pretty excited about this idea.

While my initial reverse_call proposal didn’t gain traction, subsequent discussions on Clang mailing list led to the preserve_none convention, addressing the same issue. Implementations for x86 and for aarch64 landed in Clang 19.1.0, though I haven’t found a corresponding release note for them.

Shortly after, another LLVM contributor re-discovered my idea of assigning arguments to registers starting with the normally callee-save registers, and created a PR to implement it. So in the end, the preserve_none convention in Clang turned out to be almost exactly what I had envisioned with reverse_call. Thanks to all of the Clang contributors who made this happen.

// Uses preserve_none (landed in Clang in 19.1.0).

#define CC __attribute__((preserve_none))

CC const char *Next(PARAMS);
const char *Fallback(PARAMS);

CC const char *Parse(PARAMS) {
  if (!ptr) {
    ptr = Fallback(ARGS);
  }
  ptr++;
  __attribute__((musttail)) return Next(ARGS);
}

The Python tail call interpreter uses preserve_none, and I think that all tail call interpreters would benefit from doing the same.

Conclusion

All of these developments have been exciting to see. I hope that in several more years, we’ll see more progress towards standardizing musttail into return goto, more support for preserve_none across compilers, and hopefully even more programming language interpreters will adopt the technique.

It also pessimizes the target function by forcing its prologue and epilogue to be bigger and slower. This is fine if the function is a fallback function, but if it is called in a reasonably hot path, this could slow down the code. ↩
I have seen it claimed that the GHC calling convention in LLVM does the same thing, but I was not able to independently confirm this. ↩