<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Josh Haberman</title>
    <description>Parsing, performance, and low-level programming.</description>
    <link>https://blog.reverberate.org/</link>
    <atom:link href="https://blog.reverberate.org/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Mon, 10 Feb 2025 18:06:59 +0000</pubDate>
    <lastBuildDate>Mon, 10 Feb 2025 18:06:59 +0000</lastBuildDate>
    <generator>Jekyll v3.10.0</generator>
    
    
      <item>
        <title>A Tail Calling Interpreter For Python (And Other Updates)</title>
        <description>&lt;p&gt;It’s been nearly four years since I published &lt;a href=&quot;https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html&quot;&gt;Parsing Protobuf at 2+GB/s: How
I Learned To Love Tail Calls in
C&lt;/a&gt;.
In that article, I presented a technique I co-developed for how to write really
fast interpreters through the use of tail calls and the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;musttail&lt;/code&gt; attribute.&lt;/p&gt;

&lt;p&gt;While the article focused on Protocol Buffer parsers, the technique applies to
many kinds of parsers and VM interpreters.  I published the article in the
hopes that the technique would catch on and be adopted in other projects.&lt;/p&gt;

&lt;p&gt;In the time since that article was published, there have been many exciting
developments in this space.  I want to take this opportunity to share some
updates.&lt;/p&gt;

&lt;h1 id=&quot;tail-calling-interpreter-for-python&quot;&gt;Tail Calling Interpreter for Python&lt;/h1&gt;

&lt;p&gt;I recently learned that a tail calling interpreter was going through &lt;a href=&quot;https://github.com/python/cpython/issues/128563&quot;&gt;PR review
on GitHub&lt;/a&gt;.  Authored by &lt;a href=&quot;https://x.com/kenjin4096/status/1887935698906529903&quot;&gt;Ken
Jin as part of his Bachelor’s
theses&lt;/a&gt;, it uses the
techniques described in my article and claims a 9-15% improvement geomean
improvement on &lt;a href=&quot;https://github.com/python/pyperformance&quot;&gt;pyperformance&lt;/a&gt;, the
official Python
benchmark suite.&lt;/p&gt;

&lt;p&gt;Last Friday that interpreter &lt;a href=&quot;https://github.com/python/cpython/commit/cb640b659e14cb0a05767054f95a9d25787b472d&quot;&gt;was
merged&lt;/a&gt;,
and it is officially slated to be released in Python 3.14 (&lt;a href=&quot;https://docs.python.org/3.14/whatsnew/3.14.html#whatsnew314-tail-call&quot;&gt;release
notes&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Note that the tail call interpreter is not the default yet.  It has to be enabled
at configuration time with &lt;a href=&quot;https://docs.python.org/3.14/using/configure.html#cmdoption-with-tail-call-interp&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--with-tail-call-interp&lt;/code&gt;&lt;/a&gt;.  Hopefully it will be the default in a future release.&lt;/p&gt;

&lt;p&gt;I’m very excited to see this development.  In my original article, I made the
following prediction:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;I think it’s likely that all of the major language interpreters written in C
(Python, Ruby, PHP, Lua, etc.) could get significant performance benefits by
adopting this technique.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I’m happy to see that this seems to have come true for Python.  Congratulations
to Ken on this accomplishment.&lt;/p&gt;

&lt;h1 id=&quot;tail-calling-interpreter-for-luajit-remake&quot;&gt;Tail Calling Interpreter for LuaJIT Remake&lt;/h1&gt;

&lt;p&gt;In 2022-2024, Haoran Xu published a series of
&lt;a href=&quot;https://sillycross.github.io/2022/11/22/2022-11-22/&quot;&gt;articles&lt;/a&gt; and
&lt;a href=&quot;https://arxiv.org/abs/2411.11469&quot;&gt;papers&lt;/a&gt; about an ambitious and experimental
effort to automatically generate interpreters and JIT compilers from a
description of the language semantics.  This project is called
&lt;a href=&quot;https://arxiv.org/abs/2411.11469&quot;&gt;Deegen&lt;/a&gt;, and Haoran used it to build &lt;a href=&quot;https://github.com/luajit-remake/luajit-remake&quot;&gt;LuaJIT
Remake&lt;/a&gt;, an experimental
reimplementation of the famous &lt;a href=&quot;https://luajit.org/&quot;&gt;LuaJIT&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Deegen uses the tail call pattern for its interpreters, arguing that &lt;a href=&quot;https://sillycross.github.io/2022/11/22/2022-11-22/#Why-Assembly-After-All&quot;&gt;this the
best way to get the compiler to generate good
code&lt;/a&gt;.
This conclusion aligns with my 2021 article, and his analysis cites my work.&lt;/p&gt;

&lt;p&gt;However, he concludes that C and C++ cannot achieve optimal performance for
tail calling interpreters because of limitations in the calling convention.
The main problem he cites is around callee-save registers, and how the C
calling convention forces all functions to preserve six registers, a steep cost
given how small these functions tend to be.&lt;/p&gt;

&lt;p&gt;He also mentions the restrictions on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[[clang::musttail]]&lt;/code&gt;, particularly on how
the caller and callee must have matching function signatures, which can be
inconvenient.  As a result, Deegen targets LLVM IR instead of C or C++.&lt;/p&gt;

&lt;p&gt;The first issue has been largely addressed by the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;preserve_none&lt;/code&gt; attribute
(discussed below).  Promising developments are also occurring regarding the
second complaint, as detailed below in the “Nonportable Musttail” section.&lt;/p&gt;

&lt;p&gt;LuaJIT Remake’s interpreter claims a 31% performance advantage over LuaJIT 2.1,
an impressive result given that LuaJIT already has one of the fastest
interpreters of any mainstream programming language.  It is still an
experimental project (for example, garbage collection is unimplemented, so the
performance comparison disables GC for LuaJIT too); it would be cool if Deegen
or something like it was eventually productionized and used in a mainstream
language interpreter.  Congratulations to Haoran on this accomplishment.&lt;/p&gt;

&lt;h1 id=&quot;gcc-support&quot;&gt;GCC Support&lt;/h1&gt;

&lt;p&gt;When I published my original article, GCC had no support for musttail.  I
&lt;a href=&quot;https://gcc.gnu.org/pipermail/gcc/2021-April/235882.html&quot;&gt;floated the idea of adding &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;musttail&lt;/code&gt; to the GCC maintainers in
2021&lt;/a&gt;, which
generated some interesting discussion, but no clear plan.&lt;/p&gt;

&lt;p&gt;I’m happy to report that this &lt;a href=&quot;https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83324#c4&quot;&gt;received renewed attention since
then&lt;/a&gt;.  GCC added
support for a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;musttail&lt;/code&gt; attribute &lt;a href=&quot;https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=7db47f7b915c5f5d645fa536547e26b92290afe3&quot;&gt;in
C&lt;/a&gt;
and &lt;a href=&quot;https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=59dd1d7ab21ad9a7ebf641ec9aeea609c003ad2f&quot;&gt;in
C++&lt;/a&gt;
in July of 2024.  Thanks to the GCC team for this work.&lt;/p&gt;

&lt;p&gt;GCC appears to be more strict than Clang in analyzing whether any pointers to
locals are referenced across the tail call.  I posted &lt;a href=&quot;https://github.com/llvm/llvm-project/issues/72555#issuecomment-2644233781&quot;&gt;some analysis of this
issue&lt;/a&gt;
in the Clang bug tracker.&lt;/p&gt;

&lt;h1 id=&quot;standard-c-proposal-return-goto&quot;&gt;Standard C Proposal: “return goto”&lt;/h1&gt;

&lt;p&gt;I recently heard that there is a proposal to add guaranteed tail calls to the C
standard as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;return goto&lt;/code&gt;.  The proposal is given in
&lt;a href=&quot;https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3266.htm#user-content-5-tail-call-elimination&quot;&gt;n3266&lt;/a&gt;.
The proposed syntax is:&lt;/p&gt;

&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;a&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;b&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;goto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Proposed syntax.&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It’s very exciting to think that guaranteed tail calls might eventually be added
to standard C.  It is unclear whether it will be included in C2Y, but I am hopeful.&lt;/p&gt;

&lt;p&gt;The rationale given in
&lt;a href=&quot;https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2920.pdf&quot;&gt;n2920&lt;/a&gt; does not
mention the use case of tail call interpreters, even though this was the
primary motivation for adding &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;musttail&lt;/code&gt; to both Clang and GCC.  The use case
envisioned by the committee is that C is being used as a code generation target
for a language like Scheme that guarantees tail call optimization.  The
document goes so far as to say they do not expect the feature to see
significant use in user-written C code.&lt;/p&gt;

&lt;p&gt;But even though they are not designing for the interpreter use case, the
proposed feature looks like a perfect fit for tail calling interpreters, so I’m
happy for that.&lt;/p&gt;

&lt;p&gt;Interestingly, the proposed C standard is more lax than &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[[clang::musttail]]&lt;/code&gt;
about whether the caller and callee have to match in argument types.  For
various ABI reasons, some calls are impossible to tail call optimize.  Clang
enforces a set of rules that are intended to be “portable”, so that the
tail call can be performed on “any” architecture:&lt;/p&gt;

&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;g&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Valid according to the proposed standard C feature,&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// but the implementation is allowed to fail if this&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// cannot be tail-call optimized for an implementation-specific&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// reason.&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;goto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;g&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Always rejected by Clang currently, because the caller&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// and callee do not perfectly match in function signature.&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// This is rejected even if the implementation is capable of&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// tail call optimizing the call, because Clang is trying to&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// provide a &quot;portable&quot; guarantee:&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clang&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;musttail&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;g&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I think the standard is making a good call here.  Clang attempts to ensure
“portable” tail calls, but in retrospect, I believe this approach has
limitations.  We will discuss this further in the next section.&lt;/p&gt;

&lt;h1 id=&quot;clang-nonportable-musttail&quot;&gt;Clang: “Nonportable” Musttail?&lt;/h1&gt;

&lt;p&gt;When I implemented &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[[clang::musttail]]&lt;/code&gt; in Clang, it merely exposed to C and
C++ some functionality that already existed in the LLVM backend.  The
&lt;a href=&quot;https://llvm.org/docs/LangRef.html#id332&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;musttail&lt;/code&gt; attribute for LLVM&lt;/a&gt; had
been added years earlier, and it already had a set of semantics which dictated
many things.  In particular, it dictated that the caller and callee prototypes
must match.&lt;/p&gt;

&lt;p&gt;This disallowed code like the following, even though this can be successfully
tail call optimized on many platforms:&lt;/p&gt;

&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;g&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Always rejected by Clang currently, because the caller&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// and callee do not perfectly match in function signature.&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// This is rejected even if the implementation is capable of&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// tail call optimizing the call, because Clang is trying to&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// provide a &quot;portable&quot; guarantee:&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clang&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;musttail&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;g&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;To follow LLVM’s rules, we had to make Clang reject this code.&lt;/p&gt;

&lt;p&gt;In theory, we are trading off flexibility for predictability.  Even though
&lt;em&gt;some&lt;/em&gt; platforms are able to tail-call-optimize code like the above, some are
not, and it would be surprising if your program worked fine on some platforms
but failed to compile on others.  So we have a set of rules that, if followed,
should guarantee that the code can be tail call optimized on all platforms.&lt;/p&gt;

&lt;p&gt;However, once the feature was launched, we started receiving &lt;a href=&quot;https://github.com/llvm/llvm-project/issues/54964&quot;&gt;complaints about
these rules&lt;/a&gt;.  There were
two main complaints.  First, users were understandably frustrated that a tail
call would fail to compile on a platform where the tail call was totally
possible to optimize.  It seems a bit unhelpful for the compiler to reject your
program just because it &lt;em&gt;might&lt;/em&gt; fail on a &lt;em&gt;different&lt;/em&gt; platform.&lt;/p&gt;

&lt;p&gt;Users also raised the point that this “guarantee” isn’t really a guarantee at
all.  Some platforms fundamentally do not support tail calls, like WASM without
the &lt;a href=&quot;https://github.com/WebAssembly/tail-call/blob/main/proposals/tail-call/Overview.md&quot;&gt;Tail Call
Extension&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;What this means is that the predictability that Clang/LLVM are trying to
provide is fundamentally impossible.  We’ve traded off flexibility, but we
didn’t even get predictability in return.&lt;/p&gt;

&lt;p&gt;In response, users have proposed two possible resolutions:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;We could relax &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[[clang::musttail]]&lt;/code&gt; so that it only fails if the tail call
is impossible on &lt;em&gt;this&lt;/em&gt; platform.&lt;/li&gt;
  &lt;li&gt;We could leave the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[[clang::musttail]]&lt;/code&gt; attribute the way it is, and introduce
a new &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[[clang::nonportable_musttail]]&lt;/code&gt; which has the desired semantics.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I believe option (1) is preferable, though I’m unable to implement it
currently. Hopefully, someone will address &lt;a href=&quot;https://github.com/llvm/llvm-project/issues/54964&quot;&gt;this
bug&lt;/a&gt;. Since it requires
changes to both Clang and LLVM, it will be a more substantial undertaking than
my initial &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[[clang::musttail]]&lt;/code&gt; implementation. However, as it primarily
involves relaxing constraints, it should ultimately simplify the
implementation.&lt;/p&gt;

&lt;h1 id=&quot;calling-convention-improvements&quot;&gt;Calling Convention Improvements&lt;/h1&gt;

&lt;p&gt;In the original article, I described some
&lt;a href=&quot;https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html#limitations&quot;&gt;limitations&lt;/a&gt;
where the tail calling technique could result in really bad code.  In these
scenarios, you could get tons of register spills that were bad for performance
and code size.  At the time, there weren’t great options to mitigate these
problems except to contort the code to never make a non-tail call.&lt;/p&gt;

&lt;p&gt;I am happy to report that this problem has been largely solved, at least in
Clang.  There are two calling conventions that can help.  &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;preserve_none&lt;/code&gt;
optimizes the main interpreter functions that tail-call each other, while
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;preserve_most&lt;/code&gt; benefits functions called in non-tail positions.&lt;/p&gt;

&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// preserve_none on tail calling functions optimizes register usage.&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;__attribute__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;preserve_none&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Next&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PARAMS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// preserve_most on non-tail-called functions can help optimize&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// the caller.&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Caution: preserve_most adds overhead to the target function,&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// and may not be worthwhile if the function is hot.&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;__attribute__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;preserve_most&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;RegularFunc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;__attribute__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;preserve_none&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Parse&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PARAMS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;ptr&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RegularFunc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;ptr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;__attribute__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;musttail&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Next&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ARGS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;preserve_most&quot;&gt;preserve_most&lt;/h2&gt;

&lt;p&gt;If we use
&lt;a href=&quot;https://clang.llvm.org/docs/AttributeReference.html#preserve-most&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;__attribute__((preserve_most))&lt;/code&gt;&lt;/a&gt;
on regular functions, it saves the caller from having to shuffle or spill any
of its registers prior to the call, which generally will make our tail calling
functions smaller and more efficient.&lt;/p&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;preserve_most&lt;/code&gt; attribute existed when I wrote the original article, but I
noted that it seemed to cause unexplainable crashes.  This turned out to be a
bug in Clang, which was &lt;a href=&quot;https://reviews.llvm.org/D141020&quot;&gt;fixed in 2023&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;While &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;preserve_most&lt;/code&gt; can theoretically solve the register shuffling/spilling
problem, it has a major weakness.  To get the desired effect, you have to apply
it to &lt;em&gt;all&lt;/em&gt; functions that you call in non-tail position.  Whenever you are
modifying your interpreter, if you start calling a new function, you have to
add &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;preserve_most&lt;/code&gt; to it.&lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;But what if you are calling a standard library function like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;strlen()&lt;/code&gt;?  You
can’t modify the calling convention of a function you don’t own.  You would
have to create a separate wrapper function like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;my_strlen()&lt;/code&gt; and put
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;preserve_most&lt;/code&gt; on the wrapper.  If you don’t – if you forget this discipline
for even a single function – then you get these horrible pessimized prologues
and epilogues in the fast path.&lt;/p&gt;

&lt;p&gt;While helpful, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;preserve_most&lt;/code&gt; introduces overhead to the target function
and should be used judiciously, particularly in hot code paths.&lt;/p&gt;

&lt;h2 id=&quot;preserve_none&quot;&gt;preserve_none&lt;/h2&gt;

&lt;p&gt;I had always felt that it was more promising to have a special calling
convention for the tail calling functions.  After all, they are all under our
control and they are “special” already: they all use a consistent set of
arguments, and we have to call them with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;musttail&lt;/code&gt; attribute.&lt;/p&gt;

&lt;p&gt;To pursue this opportunity, in 2021 I prototyped a &lt;a href=&quot;https://github.com/haberman/llvm-project/commit/e8d9c75bb35ce9c802f8eac522a2c6ce003f857f&quot;&gt;new calling convention I
called
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;reverse_call&lt;/code&gt;&lt;/a&gt;.
The calling convention was designed to be applied to the tail calling
functions, allowing us to get rid of all those gratuitous prologues and
epilogues.&lt;/p&gt;

&lt;p&gt;The basic idea was to have no callee-saved registers, so our interpreter
functions are freed of the burden of preserving any registers.  And critically,
we also want our argument registers to be allocated in the opposite order from
a normal function (starting with callee-save registers, which are normally
never used as arguments at all).&lt;sup id=&quot;fnref:0&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:0&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;  The net effect is that when a
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;reverse_call&lt;/code&gt; function calls a normal function, the argument registers for
caller and callee do not overlap, and moreover the caller’s argument registers
will be preserved by the callee.  This allows the interpreter to keep its
arguments in registers across the call, without any register shuffling or
spilling required.  The results from my prototype were good, and I was pretty
excited about this idea.&lt;/p&gt;

&lt;p&gt;While my initial &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;reverse_call&lt;/code&gt; proposal didn’t gain traction, &lt;a href=&quot;https://discourse.llvm.org/t/rfc-exposing-ghccc-calling-convention-as-preserve-none-to-clang/74233&quot;&gt;subsequent
discussions on Clang mailing
list&lt;/a&gt;
led to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;preserve_none&lt;/code&gt; convention, addressing the same issue.
Implementations &lt;a href=&quot;https://github.com/llvm/llvm-project/pull/76868&quot;&gt;for x86&lt;/a&gt;  and
&lt;a href=&quot;https://github.com/llvm/llvm-project/issues/87423&quot;&gt;for aarch64&lt;/a&gt; landed in
Clang 19.1.0, though I haven’t found a corresponding release note for them.&lt;/p&gt;

&lt;p&gt;Shortly after, another LLVM contributor re-discovered my idea of &lt;a href=&quot;https://github.com/llvm/llvm-project/pull/76868#issuecomment-2035874303&quot;&gt;assigning
arguments to registers starting with the normally callee-save
registers&lt;/a&gt;,
and created &lt;a href=&quot;https://github.com/llvm/llvm-project/pull/88333&quot;&gt;a PR to implement
it&lt;/a&gt;.  So in the end, the
&lt;a href=&quot;https://clang.llvm.org/docs/AttributeReference.html#preserve-none&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;preserve_none&lt;/code&gt; convention in
Clang&lt;/a&gt;
turned out to be almost exactly what I had envisioned with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;reverse_call&lt;/code&gt;.
Thanks to all of the Clang contributors who made this happen.&lt;/p&gt;

&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Uses preserve_none (landed in Clang in 19.1.0).&lt;/span&gt;

&lt;span class=&quot;cp&quot;&gt;#define CC __attribute__((preserve_none))
&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;CC&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Next&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PARAMS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Fallback&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PARAMS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;CC&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Parse&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PARAMS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;ptr&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Fallback&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ARGS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;ptr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;__attribute__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;musttail&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Next&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ARGS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The Python tail call interpreter uses &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;preserve_none&lt;/code&gt;, and I think that all
tail call interpreters would benefit from doing the same.&lt;/p&gt;

&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;

&lt;p&gt;All of these developments have been exciting to see.  I hope that in several
more years, we’ll see more progress towards standardizing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;musttail&lt;/code&gt; into
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;return goto&lt;/code&gt;, more support for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;preserve_none&lt;/code&gt; across compilers, and hopefully
even more programming language interpreters will adopt the technique.&lt;/p&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;It also pessimizes the target function by forcing &lt;em&gt;its&lt;/em&gt; prologue and
  epilogue to be bigger and slower.  This is fine if the function is a
  fallback function, but if it is called in a reasonably hot path, this
  could slow down the code. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:0&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;I have seen it claimed that the GHC calling convention in LLVM does the
  same thing, but I was not able to independently confirm this. &lt;a href=&quot;#fnref:0&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Mon, 10 Feb 2025 00:00:00 +0000</pubDate>
        <link>https://blog.reverberate.org/2025/02/10/tail-call-updates.html</link>
        <guid isPermaLink="true">https://blog.reverberate.org/2025/02/10/tail-call-updates.html</guid>
        
        
      </item>
    
    
      <item>
        <title>No-Panic Rust: A Nice Technique for Systems Programming</title>
        <description>&lt;p&gt;Can Rust replace C? This is a question that has been on my mind for many years,
as I created and now am tech lead for
&lt;a href=&quot;https://github.com/protocolbuffers/protobuf/tree/main/upb&quot;&gt;upb&lt;/a&gt;, a C library
for Protocol Buffers.  There is an understandable push to bring memory safety
  to all
parts of the software stack, and this would suggest a port of upb to Rust.&lt;/p&gt;

&lt;p&gt;While I love the premise of Rust, I have long been skeptical that a port of upb
to Rust could preserve the performance and code size characteristics that I and
others have fought so hard to optimize.  In fact, this blog entry was
originally going to be an argument for why Rust cannot match C for upb’s use
case.&lt;/p&gt;

&lt;p&gt;But I recently discovered a technique that shifted my thinking a lot.  I call
it “No-Panic Rust”, and while the technique is clearly not new&lt;sup id=&quot;fnref:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;, I was not
able to find any in-depth discussion of how it works or what problems it
solves.  This article is my attempt to fill that gap.&lt;/p&gt;

&lt;p&gt;I believe that No-Panic Rust is the key to making Rust a compelling option for
low-level systems programming.  I now am optimistic about the possibility of
porting upb to Rust.&lt;/p&gt;

&lt;h1 id=&quot;what-are-panics&quot;&gt;What are Panics?&lt;/h1&gt;

&lt;p&gt;Panics are Rust’s mechanism for &lt;em&gt;unrecoverable errors&lt;/em&gt;.  Anytime our program
encounters an error, we have three basic options for how to handle it:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Handle the error immediately (eg. retry the operation or fall back to plan B).&lt;/li&gt;
  &lt;li&gt;Propagate the error to the caller, who can decide how to handle it.&lt;/li&gt;
  &lt;li&gt;Immediately abort execution.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In Rust, we use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Result&lt;/code&gt; for (2) and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;panic!()&lt;/code&gt; for (3).  When we use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Result&lt;/code&gt;,
it is considered a “recoverable error”, because the caller can test for the
error and decide how to respond.&lt;/p&gt;

&lt;p&gt;With recoverable errors, the potential for error is reflected in the function
signature; a function that returns &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Result&lt;/code&gt; is fallible from the perspective of
the caller.  Panics on the other hand present the illusion of infallibility
from an API perspective, but then proceed to handle errors by simply aborting.&lt;/p&gt;

&lt;p&gt;There is a lot of standard guidance for when to use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;panic!()&lt;/code&gt; vs &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Result&lt;/code&gt; (for
example,
&lt;a href=&quot;https://doc.rust-lang.org/book/ch09-03-to-panic-or-not-to-panic.html&quot;&gt;here&lt;/a&gt;
and
&lt;a href=&quot;https://doc.rust-lang.org/std/macro.panic.html#when-to-use-panic-vs-result&quot;&gt;here&lt;/a&gt;),
which largely boils down to the idea that panics should only be used for bugs
in the code.  I especially like the framing given in &lt;a href=&quot;https://www.reddit.com/r/rust/comments/9x17hn/comment/e9p5c9t/&quot;&gt;this Reddit post&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;[If] your &lt;strong&gt;library&lt;/strong&gt; is the source of a panic, then one of the following
should be true:&lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt;
      &lt;p&gt;Your library has a bug.&lt;/p&gt;
    &lt;/li&gt;
    &lt;li&gt;
      &lt;p&gt;Your library documents a precondition of a public API item that, when not
met, causes a panic. Therefore, the user of your library has misused your
library, and their code has a bug.&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ul&gt;

  &lt;p&gt;If your Rust &lt;strong&gt;application&lt;/strong&gt; panics in response to any user input, then the
following should be true: your application has a bug, whether it be in a
library or in the primary application code.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In this article we are focused on the library case.&lt;/p&gt;

&lt;h1 id=&quot;why-are-panics-bad-for-systems-libraries&quot;&gt;Why Are Panics Bad For Systems Libraries?&lt;/h1&gt;

&lt;p&gt;If we are trying to port a C library to Rust, we really do not want to
introduce panics in the code, even for unusual error conditions.  They cause
many practical problems:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Code Size:&lt;/strong&gt; The runtime to handle a panic pulls in about 300Kb of code.&lt;sup id=&quot;fnref:0&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:0&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;
We pay this cost if even a single &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;panic!()&lt;/code&gt; is reachable in the code.
From a code size perspective, this is a severe overhead, given that the upb
core is only 30Kb.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Unrecoverable exit:&lt;/strong&gt; If a panic is triggered, it takes down the
entire process.&lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;  In many applications, this is a severe failure mode
that libraries should never invoke.  Instead, we should return all errors
to the caller using status codes.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Runtime overhead:&lt;/strong&gt; A potential panic implies some kind of runtime check.
In many cases, the cost of this check will be minimal, but for very small
and frequently invoked operations, the cost of this check could be
significant.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In the case of upb, I was concerned about all three of these factors.  Ideally
we could port upb to Rust without users even noticing.  To do that, we want to
maintain the same performance, code size footprint, and error reporting
behavior that the C code has now.  Panics get in the way of this ideal.&lt;/p&gt;

&lt;p&gt;At some point I realized that it might be possible to ban panics from the
library entirely, which would solve all of these problems at once.  That is
when I started getting much more optimistic about porting upb to Rust.&lt;/p&gt;

&lt;h1 id=&quot;what-is-no-panic-rust&quot;&gt;What is No-Panic Rust?&lt;/h1&gt;

&lt;p&gt;No-Panic Rust is a subset of Rust for which &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;panic!()&lt;/code&gt; is unreachable.
Programs written in no-panic Rust are guaranteed never to panic under any
circumstances.&lt;/p&gt;

&lt;p&gt;For a library, this means we should be able to build a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cdylib&lt;/code&gt; that does
not have a panic handler linked into it at all.&lt;sup id=&quot;fnref:6&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:6&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;We can experiment on &lt;a href=&quot;https://godbolt.org&quot;&gt;godbolt.org&lt;/a&gt;
to see if we have succeeded or not.  Using my tool &lt;a href=&quot;https://github.com/google/bloaty&quot;&gt;Bloaty&lt;/a&gt;,
we can see if the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cdylib&lt;/code&gt; binary is &amp;gt;300Kb (suggesting that the panic
handler has been linked in) or &amp;lt;10Ki (suggesting it has not).&lt;sup id=&quot;fnref:5&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;Let’s explore this subset a bit.  Is “Hello, World” no-panic?&lt;/p&gt;

&lt;div class=&quot;language-rust highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nd&quot;&gt;#[no_mangle]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;extern&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;C&quot;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;hello_world&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;nd&quot;&gt;println!&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Hello, World!&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;// Can panic&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;No, per &lt;a href=&quot;https://doc.rust-lang.org/std/macro.println.html#panics&quot;&gt;the documentation for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;println!()&lt;/code&gt;&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Panics if writing to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;io::stdout&lt;/code&gt; fails.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And indeed, if we try this on Godbolt, we see a big binary:&lt;/p&gt;

&lt;iframe width=&quot;100%&quot; height=&quot;200px&quot; src=&quot;https://godbolt.org/e#g:!((g:!((g:!((h:codeEditor,i:(filename:&apos;1&apos;,fontScale:14,fontUsePx:&apos;0&apos;,j:1,lang:rust,selection:(endColumn:2,endLineNumber:4,positionColumn:2,positionLineNumber:4,selectionStartColumn:2,selectionStartLineNumber:4,startColumn:2,startLineNumber:4),source:&apos;%23%5Bno_mangle%5D%0Apub+extern+%22C%22+fn+hello_world()+%7B%0A++++println!!(%22Hello,+World!!%22)++++//+Can+panic%0A%7D&apos;),l:&apos;5&apos;,n:&apos;0&apos;,o:&apos;Rust+source+%231&apos;,t:&apos;0&apos;)),k:61.11757857974389,l:&apos;4&apos;,n:&apos;0&apos;,o:&apos;&apos;,s:0,t:&apos;0&apos;),(g:!((h:tool,i:(args:&apos;--domain+vm+-n+3&apos;,argsPanelShown:&apos;1&apos;,compilerName:&apos;rustc+1.84.0&apos;,editorid:1,fontScale:14,fontUsePx:&apos;0&apos;,j:2,monacoEditorHasBeenAutoOpened:&apos;1&apos;,monacoEditorOpen:&apos;1&apos;,monacoStdin:&apos;1&apos;,stdin:&apos;&apos;,stdinPanelShown:&apos;1&apos;,toolId:bloaty11,treeid:0,wrap:&apos;1&apos;),l:&apos;5&apos;,n:&apos;0&apos;,o:&apos;bloaty+(1.1)+rustc+1.84.0+(Editor+%231,+Compiler+%232)&apos;,t:&apos;0&apos;),(h:compiler,i:(compiler:r1840,filters:(b:&apos;0&apos;,binary:&apos;1&apos;,binaryObject:&apos;0&apos;,commentOnly:&apos;0&apos;,debugCalls:&apos;1&apos;,demangle:&apos;0&apos;,directives:&apos;0&apos;,execute:&apos;1&apos;,intel:&apos;0&apos;,libraryCode:&apos;0&apos;,trim:&apos;1&apos;,verboseDemangling:&apos;0&apos;),flagsViewOpen:&apos;1&apos;,fontScale:14,fontUsePx:&apos;0&apos;,j:2,lang:rust,libs:!((name:libc,ver:&apos;02126&apos;)),options:&apos;--crate-type%3Dcdylib&apos;,overrides:!(),selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:&apos;5&apos;,n:&apos;0&apos;,o:&apos;+rustc+1.84.0+(Editor+%231)&apos;,t:&apos;0&apos;)),k:38.88242142025611,l:&apos;4&apos;,n:&apos;0&apos;,o:&apos;&apos;,s:0,t:&apos;0&apos;)),l:&apos;2&apos;,n:&apos;0&apos;,o:&apos;&apos;,t:&apos;0&apos;)),version:4&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;So &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;println!()&lt;/code&gt; is out.  If we want to print to stdout, we’ll need to use an API
that does not advertise that panic is possible.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;https://doc.rust-lang.org/beta/std/io/fn.stdout.html&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;stdout&lt;/code&gt;&lt;/a&gt; API looks
promising, because it has a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;write_all()&lt;/code&gt; API that returns a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Result&lt;/code&gt;, which
should allow us to handle errors explicitly:&lt;/p&gt;

&lt;div class=&quot;language-rust highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nn&quot;&gt;io&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::{&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Write&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;

&lt;span class=&quot;nd&quot;&gt;#[no_mangle]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;extern&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;C&quot;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;hello_world&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;bool&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;nn&quot;&gt;io&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;stdout&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.write_all&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;b&quot;Hello, World!&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.is_ok&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This seems like it should be no-panic.  We are only calling two APIs,
&lt;a href=&quot;https://doc.rust-lang.org/beta/std/io/fn.stdout.html&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;stdout()&lt;/code&gt;&lt;/a&gt; and
&lt;a href=&quot;https://doc.rust-lang.org/beta/std/io/struct.Stdout.html#method.write_all-1&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;write_all()&lt;/code&gt;&lt;/a&gt;,
neither of which documents a potential panic.&lt;/p&gt;

&lt;p&gt;But if we try it, we’ll see that panic is indeed reachable in this program
somehow.&lt;/p&gt;

&lt;iframe width=&quot;100%&quot; height=&quot;260px&quot; src=&quot;https://godbolt.org/e#g:!((g:!((g:!((h:codeEditor,i:(filename:&apos;1&apos;,fontScale:14,fontUsePx:&apos;0&apos;,j:1,lang:rust,selection:(endColumn:55,endLineNumber:5,positionColumn:55,positionLineNumber:5,selectionStartColumn:55,selectionStartLineNumber:5,startColumn:55,startLineNumber:5),source:&apos;use+std::io::%7Bself,+Write%7D%3B%0A%0A%23%5Bno_mangle%5D%0Apub+extern+%22C%22+fn+hello_world()+-%3E+bool+%7B%0A++++io::stdout().write_all(b%22Hello,+World!!%5Cn%22).is_ok()%0A%7D&apos;),l:&apos;5&apos;,n:&apos;0&apos;,o:&apos;Rust+source+%231&apos;,t:&apos;0&apos;)),k:61.11757857974389,l:&apos;4&apos;,n:&apos;0&apos;,o:&apos;&apos;,s:0,t:&apos;0&apos;),(g:!((h:tool,i:(args:&apos;--domain+vm+-n+3&apos;,argsPanelShown:&apos;1&apos;,compilerName:&apos;rustc+1.84.0&apos;,editorid:1,fontScale:14,fontUsePx:&apos;0&apos;,j:2,monacoEditorHasBeenAutoOpened:&apos;1&apos;,monacoEditorOpen:&apos;1&apos;,monacoStdin:&apos;1&apos;,stdin:&apos;&apos;,stdinPanelShown:&apos;1&apos;,toolId:bloaty11,treeid:0,wrap:&apos;1&apos;),l:&apos;5&apos;,n:&apos;0&apos;,o:&apos;bloaty+(1.1)+rustc+1.84.0+(Editor+%231,+Compiler+%232)&apos;,t:&apos;0&apos;),(h:compiler,i:(compiler:r1840,filters:(b:&apos;0&apos;,binary:&apos;1&apos;,binaryObject:&apos;0&apos;,commentOnly:&apos;0&apos;,debugCalls:&apos;1&apos;,demangle:&apos;0&apos;,directives:&apos;0&apos;,execute:&apos;1&apos;,intel:&apos;0&apos;,libraryCode:&apos;0&apos;,trim:&apos;1&apos;,verboseDemangling:&apos;0&apos;),flagsViewOpen:&apos;1&apos;,fontScale:14,fontUsePx:&apos;0&apos;,j:2,lang:rust,libs:!((name:libc,ver:&apos;02126&apos;)),options:&apos;--crate-type%3Dcdylib&apos;,overrides:!(),selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:&apos;5&apos;,n:&apos;0&apos;,o:&apos;+rustc+1.84.0+(Editor+%231)&apos;,t:&apos;0&apos;)),k:38.88242142025611,l:&apos;4&apos;,n:&apos;0&apos;,o:&apos;&apos;,s:0,t:&apos;0&apos;)),l:&apos;2&apos;,n:&apos;0&apos;,o:&apos;&apos;,t:&apos;0&apos;)),version:4&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;From this we have learned that we unfortunately cannot rely on panic
annotations in API documentation to determine &lt;em&gt;a priori&lt;/em&gt; whether some Rust code
is no-panic or not.  We have to actually try it and observe the results.&lt;/p&gt;

&lt;p&gt;How can we diagnose what went wrong?  On macOS, the linker has a very handy
option called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-why_live&lt;/code&gt;, which will print the chain of symbol references
that prevented a symbol from being dead-stripped.  We can’t access it on Godbolt
unfortunately, but on macOS we can run this command:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ RUSTC_LOG=rustc_codegen_ssa::back::link=info \
  RUSTFLAGS=&quot;-C link-arg=-Wl,-why_live,_rust_panic&quot; \
  cargo build --release 2&amp;gt;&amp;amp;1 | rustfilt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This results in the following output, with extraneous details removed:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;_core::panicking::panic from [...]
  _core::ops::function::FnOnce::call_once from [...]
    l_anon.56b0c16dbe4596c74313e318a3dfaa78.520 from [...]
      _std::sync::once_lock::OnceLock&amp;lt;T&amp;gt;::initialize from [...]
        _std::io::stdio::stdout from [...]
          _hello_world from [...]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The panic reference apparently comes from
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;_core::ops::function::FnOnce::call_once&lt;/code&gt;, which is called
from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;_std::io::stdio::stdout&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This seems to suggest that Rust’s standard library does not meet the criteria
given &lt;a href=&quot;#what-are-panics&quot;&gt;above&lt;/a&gt;, because it is capable of panicing even in APIs
like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;std::io::stdout()&lt;/code&gt; that do not document a panic-worthy precondition.&lt;/p&gt;

&lt;p&gt;This also implies that we need tests that check for the no-panic property.
It’s not enough to check once that the code is no-panic, we need to make sure
it &lt;em&gt;stays&lt;/em&gt; no-panic over time, even as our project and our dependendencies
evolve.&lt;/p&gt;

&lt;p&gt;To get a fully no-panic version of “Hello, World”, we have to reach for the C
library &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;libc&lt;/code&gt;.  This makes sense, since the C library is generally written to
return all errors as status codes or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;errno&lt;/code&gt;.  Unfortunately this means turning
to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;unsafe&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-rust highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;extern&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;crate&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;libc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;nd&quot;&gt;#[no_mangle]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;extern&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;C&quot;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;hello_world&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;bool&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;MSG&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;&apos;static&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Hello, World!&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n\0&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;unsafe&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;nn&quot;&gt;libc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;printf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;MSG&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.as_ptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And checking on Godbolt, we see the small binary that confirms that this
library is indeed no-panic:&lt;/p&gt;

&lt;iframe width=&quot;100%&quot; height=&quot;250px&quot; src=&quot;https://godbolt.org/e#g:!((g:!((g:!((h:codeEditor,i:(filename:&apos;1&apos;,fontScale:14,fontUsePx:&apos;0&apos;,j:1,lang:rust,selection:(endColumn:2,endLineNumber:10,positionColumn:1,positionLineNumber:1,selectionStartColumn:2,selectionStartLineNumber:10,startColumn:1,startLineNumber:1),source:&apos;extern+crate+libc%3B%0A%0A%23%5Bno_mangle%5D%0Apub+extern+%22C%22+fn+hello_world()+-%3E+bool+%7B%0A++++const+MSG:+%26!&apos;static+str+%3D+%22Hello,+World!!%5Cn%5C0%22%3B%0A++++let+result+%3D+unsafe+%7B%0A++++++++libc::printf(MSG.as_ptr()+as+*const+_)%0A++++%7D%3B%0A++++result+%3E%3D+0%0A%7D&apos;),l:&apos;5&apos;,n:&apos;0&apos;,o:&apos;Rust+source+%231&apos;,t:&apos;0&apos;)),k:61.11757857974389,l:&apos;4&apos;,n:&apos;0&apos;,o:&apos;&apos;,s:0,t:&apos;0&apos;),(g:!((h:tool,i:(args:&apos;--domain+vm+-n+3&apos;,argsPanelShown:&apos;1&apos;,compilerName:&apos;rustc+1.84.0&apos;,editorid:1,fontScale:14,fontUsePx:&apos;0&apos;,j:2,monacoEditorHasBeenAutoOpened:&apos;1&apos;,monacoEditorOpen:&apos;1&apos;,monacoStdin:&apos;1&apos;,stdin:&apos;&apos;,stdinPanelShown:&apos;1&apos;,toolId:bloaty11,treeid:0,wrap:&apos;1&apos;),l:&apos;5&apos;,n:&apos;0&apos;,o:&apos;bloaty+(1.1)+rustc+1.84.0+(Editor+%231,+Compiler+%232)&apos;,t:&apos;0&apos;),(h:compiler,i:(compiler:r1840,filters:(b:&apos;0&apos;,binary:&apos;1&apos;,binaryObject:&apos;0&apos;,commentOnly:&apos;0&apos;,debugCalls:&apos;1&apos;,demangle:&apos;0&apos;,directives:&apos;0&apos;,execute:&apos;1&apos;,intel:&apos;0&apos;,libraryCode:&apos;0&apos;,trim:&apos;1&apos;,verboseDemangling:&apos;0&apos;),flagsViewOpen:&apos;1&apos;,fontScale:14,fontUsePx:&apos;0&apos;,j:2,lang:rust,libs:!((name:libc,ver:&apos;02126&apos;)),options:&apos;--crate-type%3Dcdylib&apos;,overrides:!(),selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:&apos;5&apos;,n:&apos;0&apos;,o:&apos;+rustc+1.84.0+(Editor+%231)&apos;,t:&apos;0&apos;)),k:38.88242142025611,l:&apos;4&apos;,n:&apos;0&apos;,o:&apos;&apos;,s:0,t:&apos;0&apos;)),l:&apos;2&apos;,n:&apos;0&apos;,o:&apos;&apos;,t:&apos;0&apos;)),version:4&quot;&gt;&lt;/iframe&gt;

&lt;h1 id=&quot;opt-no-panic&quot;&gt;Opt No-Panic&lt;/h1&gt;

&lt;p&gt;What about adding two numbers?  Is this no-panic?&lt;/p&gt;

&lt;div class=&quot;language-rust highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nd&quot;&gt;#[no_mangle]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;extern&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;C&quot;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;hello_world&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;i32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;i32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;i32&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This is a trick question: this is no-panic in opt mode only.  For numeric
operations like addition, Rust introduces overflow checks (which panic on
failure) in debug mode, but leaves them out of opt builds.&lt;/p&gt;

&lt;p&gt;We can observe this on Godbolt if we add separate panes for opt and non-opt
builds:&lt;sup id=&quot;fnref:4&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;iframe width=&quot;100%&quot; height=&quot;400px&quot; src=&quot;https://godbolt.org/e#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DEArgoKkl9ZATwDKjdAGFUtEywYgATKUcAZPAZMADl3ACNMYhBpAAdUBUI7Bhc3D29SOITbAQCg0JYIqOkrTBskoQImYgIU908fErKBCqqCXJDwyOjLSuratIbetsCOgq7JAEpLVBNiZHYOAFIvAGZFgFYAIQZUAH0WQ2B6DYARRY0AQRiTMIBqTFUCSIZb5a8nN9uqF4RMWlo9gB3Ei0dAQJggW54FY%2BW5hSHQrwTW4AWkWK2wUJhrwA7JtzhdbkTbkxXl5NnCCYscScOFNaJx1rxPBwtKRUJwAEpmAi3BQzOaYMkrHikAiaOlTADWIHWKwAdABOHEANhhXjVAA4uCtFWr1vpOJJmRL2ZxeAoQBoxRKpnBYDBECAiK5yJQqsBLRwUSj0KgDoFbgA3Fiol4rXgehTKQx/IQIVCAlmitAsGJ0SLBVgLUzmZC3LjyzWSeVszD4IjEPDoPT8QQiMTsKQyQSKFTqVk6dIsARMNC4QgkAASTAUm0wjAuJiIAHkYoxy5xRT3mP2KyQ54wl7wV33UBV8MneDFY1m2JxNgCmAQAJ63CCoGK8sImOjoCYWgiH809Q8xoK0PGiZHmKqCuAAkjWYRXreXBcGKxATtWVqkICxBMDE270oyJqdmaHCXqg153g%2BT5wq%2BoIfjanZ2kgqbpvQZAUBA9EZlExBcMW1o0LQTzEJaEDwnhYSBFUN7bqQInMMQN4zmE2ilOKeGpmwggzgwtDiXhWAvsAThiLQXqilgBxGOI2l4IhZRBpgXplqopTTgsoqBE8DJ4bQeBhOhMkuFgpoEFWLASTZxBhPEmAnJgplHIEoA0XwBiegAangmCApuIF1sIojiM22Vtmopq6D4BhGCAuYWJ5YSWrAzDntVyCkKFmheFwGqSuyT5JF6PrIOhTwore87oicyDoDe1WojOvCoKFVZYLVEBTI02SeBAjj9J48H%2BCM%2BSFHomSJAIW2HfEx0MO0%2B1dPBq3lEMp23X8in3a0V2dFEt0Pa4dR6OYb17R9EgrQK8zA4aHBMqQLJshyHC3JV%2BaFsWpb3gOlbClwVFKVoEzSiAKwaEquqSDiOIrJIKqKhqGgrBDxqkMF6zWjDs0/pa1o43SpD2k6Lq0G64LEJ6nA%2Bn6AYvCGYa3BGpBRv%2BcYJkmEmsYxZ45jySNFiWZbrgttayA2eXSAVShFXhujwbua6DsQI5jhODBTrO85BDW3A7r2NuVplEnW/uX6BBJJ55NmF4wSROwMCij7PhR76ft%2BHCJ4ECuAUrIH85BIDQURsHwWhGFYRDUOs/hhHEfeUcx2RL5vtjto83R/oMZEguq10HFcXwdB8QJQlslJYkSUPMlyQpNimipjAEOpmmmjpJh6QZRm8CZhzmWy%2BBWbYNl2bwDyOU8EmuX8prVT5N5%2BQsbKBXgwUe81kThUoUUxZ55UJVQSUKKl6W%2B4/bKRsmwm1kIVDsbJLb6EOBVHk%2BgvJLSmLHHqosUT9WvJgIaN4RorDGhNaqs15rVlsvACA9V2Dlgus1MQJgFheA0G1Tqd17AbQYM4H6aQdpsPemMT6GRzprUevwrISQeEHSetYNaLQ%2BgcO2pYZ6TQGDSOGHkIGX1WhCP%2BtUMRN0QazDBljEuuFYacARprAs2tUYQHRiQTGDcaJTF%2BEwLAURlr0x3LKFmpo4YWhQlzPGpAZSU1LKsFUZMVgqk4iqWmOIIayzLj46iuNsIcC8MYtmyckmdVCgkewkggA%3D&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;This essentially creates a new class of code, which is “no-panic in opt, but
can panic in dbg”.&lt;/p&gt;

&lt;p&gt;For the case of upb, this seems like a great option, because it gives us extra
consistency checks in debug mode without suffering the problems of panic in
release builds.  It is essentially the Rust equivalent of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;assert()&lt;/code&gt; in C.
Overflow by itself does not represent a safety issue, so we are not giving up
safety by leaving the panics out of opt builds.&lt;/p&gt;

&lt;h1 id=&quot;rusts-standard-library&quot;&gt;Rust’s Standard Library&lt;/h1&gt;

&lt;p&gt;What about using standard containers like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Vec&lt;/code&gt;?&lt;/p&gt;

&lt;div class=&quot;language-rust highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nn&quot;&gt;hint&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;black_box&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;nd&quot;&gt;#[no_mangle]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;extern&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;C&quot;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;hello_world&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;vec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;Vec&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;u32&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;Vec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;black_box&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;vec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It turns out this is also “opt no-panic” code (perhaps &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Vec&lt;/code&gt; is internally
performing some arithmetic which can overflow):&lt;/p&gt;

&lt;iframe width=&quot;100%&quot; height=&quot;400px&quot; src=&quot;https://godbolt.org/e#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DEArgoKkl9ZATwDKjdAGFUtEywYgATKUcAZPAZMADl3ACNMYhAAdlIAB1QFQjsGFzcPb3jE5IEAoNCWCKjYq0wbFKECJmICNPdPH1LygUrqgjyQ8MiYyyqauozGvvbAzsLu6IBKS1QTYmR2DjNMAGpzdBAQBECCTbCDZABrAH0w1FUAUgBmACELjQBBe6evK4uAVhuGVGOWQ2B6B8ACLPOImMIrTCqAiRBgrC5eLxOBFeFZUOEITC0Wg/ADuJFo6Agk3h0TujxWlJW9AIKwAbmUQCsAGpla5OExXLzXbDwq5AlmMkBBXHE67kh5Ulb7JhHU7nCAM5CTcXPC7RIEcaa0TjvXieDhaUioTgAJTMtIUs3mqwRVx4pF2hq100OIHeVwAdABOaIANi5XgDAA4uFdvQH3vpOJJ9ZpeCaOLwFCANI749M4LAYIgQERXORKNVgCmOABaMvoVB/QL0lgrMtwq68YsKZSGLFCBCoXEGh1oFhxOiRYKsRamczIFZcT3BySeo2YfBEYh4DZcGSCERidhSTfyJRqeOkXQ%2BFgCWWoXCEEgACSYChumEYDxMRAA8nFGEvOA7z8w0GvFdP0YX9eH/S9KnwPteDiDtRzYTgbhxJgCAATxWCBUDiWkwhMOh0EmZMCGgzhiOg9sgloLsexgx1UFcABJDZ9lQVC0K4DcCGIZ811TUhcWIJg4jA7VdTjZ1jSQlD0Mw7DcPwwkiPTZ1MyQAch3oMgKAgDThyiYguDnNMaFoGFiBTCAwmPMJAmqNCwNIWzmGIND3zCbQyidI0BzYQR3wYWgHMkrA8OAJwxFoUsHSwP4jHEEK8B48oGVLRdVDKN9FgdHYsWPWg8DCITXJcLBj24vAWEchliDOJQgUwOKAUCUBVL4AwS2ZPBMFxEC6P4LdRHEPcBoPFR1Ek099H%2BEAJwsAqwhTWBmEQhbkFIGrNC8LggxdY0cJSUsK2QISYTLdCv2uIFkHQNCFobd8Exq1csCW4lLCxLyUgcBhnFceo9H8UYCiKPQEiSWwBAGTwN3BnIGA6EHug3JpIYYVp%2Bn%2BjIUc%2B5p0eGRGuiiFHhmhvRzDaQnxmJ6YrTmBYJDEjg9VIA0jUTFY5qnGc5wXTCgJIeFXi4ZTvL2t0rg0H1w0kaJoiuSQ/W9IMNCuaMOFjUgqveNM2YTMjLH4sXMxzCAkHzWhCwgVtOArKsazhOl60bFZm1IVtKM7bte0cvStIQ8cLW52d50XZcSD4jdRu3YbpFGxRxuPXQNwgwDw%2BIe9H2fBhXw/L8gg2bhwIvNOb2IPrHNT1AoMCRy4PyMdpLY2SIG%2BBgy3k6VFMI8ja6TXoKI7ajvboi3mJAVj2M4gShJEoumZZvWpI4ZDm4w1uBA7nCu4I0WM1IM2UGrTTIitv3ukM4y%2BDoczLOsyTnPsxzH9c9zPJsY9fMYAgAqC49QpMOFSK0VeCxX%2BAlI0%2BBkq2FSseKEmUYSOVyjqSSC1ipoVKosI0FUqpFw2pEOqmAGpNQKkYfeVAOoKC6j1CueDo5DV3HHWQCcjyTT0AYMhc19CFTetMeSh1bZlhOqhTA500KXX5DdO6hUnqRBepgN6rdG5LnhhtMQJhFheA0NtPaqNvoQEcGTDcQN8hEzBtkNGRisgQxSFTUGONrBowxrULGMMPqOIqATYGZiSZtCsRTGodjka02tAzEW6tF7Hg5lzacIc%2BYQAFsQIW9o96qWmJiJgWAojvRQZrbWusokGxTGmY2roQCKwXK8P0csrh%2BiMn6VWsQUFuyXomXgpT1ZeAkuzA2HSapJHsJIIAA&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;But once we try to actually push elements into the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Vec&lt;/code&gt;, we’re squarely out
of no-panic Rust:&lt;/p&gt;

&lt;iframe width=&quot;100%&quot; height=&quot;400px&quot; src=&quot;https://godbolt.org/e#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DEArgoKkl9ZATwDKjdAGFUtEywYSATKUcAZPAZMADl3ACNMYhAAVlIAB1QFQjsGFzcPb3jE5IEAoNCWCKjYq0wbFKECJmICNPdPLh9S8oFK6oI8kPDImMsqmrqMxr72zoKimIBKS1QTYmR2DjNMAGpzdBAQBECCTbCDZABrAH0w1FUAUgBmACELjQBBe6evK4vom4ZUY5ZDYHp3gARZ5xExhFaYVQESIMFYXLxeJzwrwrKiwhCYWi0b4AdxItHQEEmcIA7HdHitKSt6AQViwTLSAG5lEArABqZWuThMVy812wcKugPZLJAQRxROu5IeVJWzOQADpQQoEBAuJMpc9ZfsmEdTucIPKNbdnhcSYCONNaJxorxPBwtKRUJwAEpmWkKWbzVbwq48Ui7B2W6aHGJXBUAThJADZeV5YwAOLhXCOx2LWjiSO2aXjOji8BQgDQBnPTOCwGCIEBEVzkSjVYCFjgAWmb6FQv0CcpYK2bsKuvAbCmUhkxQgQqBx9v9aBYcTokWCrEWpnMyBWXAVCckCsdmHwRGIeA2XBkghEYnYUjP8iUahzpF0PhYAl1qFwhBIAAkmAobphGAeBlUAAeTiRh904f0X2YNAP0PMDGCg3gYLfSp8GnXg4lHJc2E4G5sSYAgAE8VggVA4lpMITDodBJgLAgMM4BiMJHIJaHHSdMIDVBXAASQ2fZUCI4iuFPAhiAA48i1IHFiCYOJkKtG1syDJ18MIkiyIoqiaIJeiSyDMskFned6DICgIFMhcomILht2LGhaGhYhCwgMIHzCQJqmI5DSC85hiGIkCwm0MpA0dWc2EEECGFoXy1KwajgCcMRaCbf0sF%2BIxxESvBJPKZkmz3VQygZRZ/R2TEH1oPAwnkoKXCwB8JLwFg/OZYgziUQFMGy/5AlAIy%2BAMRs2TwTAcUQ7j%2BHPURxGvWbbxUdQ1KffQ/hAVcLFqsJC1gZg8N25BSE6zQvEaaNgydSiUibVtkHk6FmxI8DrkBZB0GI3bexA3NOqPLB9qJSxMXClIHAYZxXHqPR/ECLpCh6U8EiSWwBEGBosjRlIxm6KJT2adGGDaAYYaGUHrGJ0mOgR8ZkZGMn0ix8xRjp/GJGmT05gWTn9BU0h7UdPMVm29dN23XcyPgkg4VedVeAi67QyuDRIxTSQSRJK5JGjCN4w0K5%2BczFCYmLIXc2YywZKVstKwgJAa1oOsICHThW3bTtYUZHs%2BxWAdSCHNixwnKc/Os8zcJXd1xa3Hc9wPEhpNPJaLwW6QlsUFaH10U9ULgxPiB/P8AIYICiGmyDuBQ18C8/Yhpr8/PUHQwI/Ow/Jlw04StIgL4GGbHSVmo2iDPWNv8z6VjRw40PuKdgSQCEkSxNk%2BTFOr5SOFtQWHzzAie9IvuBEHyjh70ujFdLUgHZQDszMiF2I56OyHL4OgXLcjy1ICny/N/oKIUwo2AfFFRgBBYrxQfElEwKU0oZV4FlP4uVHT4AKrYIqD5IRlWhH5KqGZHS7QasRJqixHStXatXU6kRuqYF6v1WqRhr5UFGgocak1G5UNTvNK8GdZBZ3vGtPQBgmHbX0HVYG0wdJ3Xds2R6RFMAvWIm9IUn1vp1X%2BpEQGmBgZ9y7vuHIDBTpiBMIsLwGgLrXSJhDCAjhMZwyhnjJGBNsaGPsSjbIxMnETEJmDFoJN%2Bi1HJizPx1NAneIZqzJmsNCbhPZs4vm3NvR8wzDvC26kOCixjhuOOUsIAy2IHLP0BlbbTAxEwLAUQQYZizKQdq0RzZ7ytoWYspTSChl1ruV40YtZXGjPZaMhsSTGwDukvMV8jJby8KpYWVs2mdSSPYSQQA%3D%3D&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Vec&lt;/code&gt; does have a few APIs that will surface allocation errors instead of panicking.
Theoretically, this code should be no-panic:&lt;/p&gt;

&lt;div class=&quot;language-rust highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nd&quot;&gt;#![feature(vec_push_within_capacity)]&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nn&quot;&gt;hint&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;black_box&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;nd&quot;&gt;#[no_mangle]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;extern&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;C&quot;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;hello_world&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;bool&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;mut&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;vec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;Vec&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;u32&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;Vec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;vec&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.try_reserve&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.is_ok&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;vec&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.push_within_capacity&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.is_ok&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;black_box&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;vec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;true&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This requires the nightly compiler, but I was able to make this work as no-panic
on macOS.  For some reason, it did not work with the nightly compiler on Godbolt,
which appears to always include the panic runtime no matter what I do, even for a
trivial library.  I was not able to figure out why.&lt;/p&gt;

&lt;p&gt;The Rust standard library was not really designed to be no-panic.  For example,
memory allocation failure will panic in most cases.  If we want to be no-panic,
we will probably have to avoid most of the standard library.  Realisticaly we
will probably want to go fully &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;#![no_std]&lt;/code&gt;.&lt;/p&gt;

&lt;h1 id=&quot;a-dance-with-the-optimizer&quot;&gt;A Dance With The Optimizer&lt;/h1&gt;

&lt;p&gt;Here is another trick question: is this no-panic Rust?&lt;/p&gt;

&lt;div class=&quot;language-rust highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nd&quot;&gt;#[no_mangle]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;extern&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;C&quot;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;hello_world&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;u8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;u8&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;On one hand, the slice index operation &lt;a href=&quot;https://doc.rust-lang.org/std/ops/trait.Index.html#tymethod.index&quot;&gt;clearly
documents&lt;/a&gt;
that it may panic.  On the other hand, the docs say that this panic will only
be triggered if the index is out of bounds, and we have inserted a guard to
ensure that it never is.  So is the panic reachable?&lt;/p&gt;

&lt;p&gt;If we use our minds to reason about the code, we would conclude that panic is
unreachable.  The compiler is capable of reaching the same conclusion, but only
if we run the optimizer, which can prove through a series of optimizations that
the bounds check will never fail.&lt;/p&gt;

&lt;p&gt;So this example ends up being “opt no-panic”, just like our arithmetic
operation, but for an entirely different reason!&lt;/p&gt;

&lt;iframe width=&quot;100%&quot; height=&quot;400px&quot; src=&quot;https://godbolt.org/e#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DEArgoKkl9ZATwDKjdAGFUtEywYgATKUcAZPAZMADl3ACNMYhAAdlIAB1QFQjsGFzcPb3jE5IEAoNCWCKjYq0wbFKECJmICNPdPH1LygUrqgjyQ8MiYyyqauozGvvbAzsLu6IBKS1QTYmR2DgBSLwBmJYBWACEGVAB9FkNgek2AESWNAEE4kzCAakxVAkiGO5WvJ3e7qleETFpaPsAO4kWjoCDoJhVEBvLwANk2WxMAA4zpM7gBaJarbB3FFvaJbC6XO6ku54Kh3SFVAB09AYEHR2Kcdy4BKJVzJXLuxEwBDmrw02I5JLJS2i505ZOpTERQo2ksu4tOHGmtE4G14ng4WlIqE4ACUzAQ7gpZvNMLDVjxSARNKrpgBrEAbVY0gCc0Thqy88NWyK4q3d3o2%2Bk4ki19r1nF4ChAGlt9umcFgMEQICIrnIlGqwDjHAxGPQqEOgTuADcWJjXqteLmFMpDP8hAhUEDtTa0Cw4nRIsFWItTOZkKyacjJDTdZh8ERiHh0Hp%2BIIRGJ2FIZIJFCp1DqdJkWAImGhcIQSAAJJgKLaYRiXExEADycUY084NoPzGPM5IT8Yb94H5HqglT4B2vBxE2/ZsJwWyAlCACedwQKgcQmmEJh0OgkyxgQoExr0oGNkEtAtm2YG2qgrgAJILmEcEEPBXBcLavKYPO8akECxBMHE/5qhqka7tGHCwagCFIShaEYWC2GJruyZIF2Pb0GQFAQEpvZRMQXDjgmNC0M8xBxhAYRRmEgTVPB/6kOZzDEPBD5hNoZR2kJXZsIID4MLQVlCVg6HAE4Yi0PmNpYIcRjiH5eC8uU5aYPmU6qGU96LDagTPOqQm0HgYTcfZLhYFGBBziw1nxcQYSJJgpyYBFxyBKA8l8AYeYAGp4JgQK/uRS7CKI4jrn1W5qFGug%2BAYRggEOFg5WEcawMw0FzcgpAVZoXhcPCDp6qhKT5oWyDcc8GIMc%2B2KnMg6DwXNmIPrwqAVXOWALYylj/C5KQOAwziuPUej%2BKMBRFHoCRJLYAgDJ4zFgzkDAdMD3TMU0EMMK0/R/RkyMfc0aPDAjXRRMjwxQ3o5htAT4xE9MZpzAsEj8RwmqkNqur6hwPLGiOXBjhOGhISes5WlwsmuVokxOiAqwaB6QaSNE0SrJIcLuvCGirGGHARqQZUbAmrMPfhcYJmLqqkCm6aZrQ2YQPWnCFsWpavJW1Z3LWpD1kRzatu21kaSpUGDlzo7jpOvDTqez2LrIK6DdIw1KKNQm6MxgFfpHF5XjeDB3o%2Bz5BAu3AAYe6ezj11lp8BuGBNZEH5AOMH0YhEC7AwGKSXc6GYbJ5h4RwOGEU2JE%2B%2BRVs0SAdFiQxTGcdxvFF4zzMG8JoniS3Ajt6hnfSVhvCmwpKAlspkQ2/73TabpfB0IZxmmUJtmWdZD/2Y5zk2FG7mMAQXk%2BVG/kmIFYKoVeDhSOFFXU%2BBYq2HiolcOyVkCpWshlf4UY5r5XgoVRYuoSp4DKkXNakQqpKFqvVHKU1mpUFagoDqXVy74L6rHNc8dZAjR3LqFO%2BgjjTWNPoXKr1piSX2vbDER0oSYFOvBc6qxLrXTmg9J684ErwBbg3COqM1piBMIsLwGhNo7RRl9CAjhSbMUBvkQmoNsioxMVkcGKRKYg2xtYVG6NaiY2hu9ZxFR8ZAwscTNoNjyY1AcUjGm5p6Yi01kvKM7NObDhDnzAW35iDC1FkmaYfwmBYCiG9LK2tdb6xiUbDi%2B9JbK0nGsOECtVhwh0nCdWsQsru2XuzPe6TNZeEEmzfCpSCFGS%2BpIIAA%3D%3D&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;This is quite an interesting result that totally changed my thinking about
Rust’s bounds checks.&lt;/p&gt;

&lt;p&gt;My previous perspective was that Rust will insert all of these unnecessary
bounds checks, bloating the code and slowing it down for no reason.  But our
pre-existing C code is not throwing caution to the wind and hoping for the
best.  Every place that we perform an index operation in C, it’s because we
believe we have a proof that the index is in bounds.  To avoid the bounds
checks in Rust, we just need to express this proof in a way that the Rust
optimizer can understand.  This is what I call the “dance with the optimizer.”&lt;/p&gt;

&lt;h1 id=&quot;a-slightly-more-dangerous-dance&quot;&gt;A Slightly More Dangerous Dance&lt;/h1&gt;

&lt;p&gt;In the example above, the bounds check is eliminated using only safe code,
but there are other cases where we might need to use unsafe code to help
the optimizer know about program invariants that cannot be easily derived
from the program flow.&lt;/p&gt;

&lt;p&gt;For example, consider this (admittedly contrived) program:&lt;/p&gt;

&lt;div class=&quot;language-rust highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;S&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&apos;a&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&apos;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;u8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;ofs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;usize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// Invariant: ofs &amp;lt; data.len()&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;impl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&apos;a&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;S&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&apos;a&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;u8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;Option&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;S&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;match&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;Some&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;S&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ofs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}),&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;u8&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;.data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;.ofs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;nd&quot;&gt;#[no_mangle]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;extern&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;C&quot;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;hello_world&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;S&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;u8&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In this program, our struct &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;S&lt;/code&gt; has an invariant that the offset &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;S::ofs&lt;/code&gt; will
always be in bounds.  This invariant effectively guarantees that the bounds
check in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;S::get()&lt;/code&gt; will never fail.  And we can strongly guarantee that the
invariant holds, because it is enforced by our &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;new()&lt;/code&gt; function which is the
only code that sets these struct members.&lt;/p&gt;

&lt;p&gt;But the optimizer isn’t capable of reasoning at this level, so it thinks that
the panic is reachable, and keeps the bounds check in the program, even in opt
mode:&lt;/p&gt;

&lt;iframe width=&quot;100%&quot; height=&quot;600px&quot; src=&quot;https://godbolt.org/e#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DEArgoKkl9ZATwDKjdAGFUtEywZ7HAGTwNMAHLuAEaYxCAArKQADqgKhHYMLm4eerHxtgK%2B/kEsoeFRVpg2iUIETMQEye6eXJaY1pkMZRUE2YEhYZGW5ZXVqXXmre25%2BZEAlJaoJsTI7BzRJsEA1OamNstCAKQAzE5gHEy72MtbAOwAQlsaAILL98voTOUgpwBMAGwHRxEXJgAcWwiABFSNc7g9UFQFK8zHgAF6YUgPAD0KOWAEkGAA3Cp4QwEV5QhSnPaPZ5MAB09AYEHG4POwIZtzwLGitF2%2B0Ox02nO%2BPPOV1uD2WixWVAYy38AHcIE8Xu8PkC/oCQeNlgBaHkAeWiTU52x2J0F4JFIpYz2QCHJ5WpjDpp0uprNLo0pKZRuWAQESOdLpFkt2HpOQlQbAg20u8qYr2jyOJr0lGuWXEZkz9ZsZGaztwzYuWEuWwEwBAgW0%2BVio6q1noBjqFEJdlcp0eVzeJQKZwoeOZuvYZbx2yoYqAA%2BhajPRO%2BD85hVAQwoG3m8nOW3gXJQgGrQx9KSLR0BAYYqhNWeXWTd37gpKcXS/Tc2dgRxJrROBFeJ4OFpSKhOAAlMwCFWaZZkwd4dh4UhCW/F9JgAa0iHZKQATjOD4dmXDD/i4HYUIwqI3w4SRP00Xg/w4XgYQ0aCyMmOBYBgRAQCIVxyEoCpgBhDgNQ1dAwyYPxlmxFhNUlHZeE4hRlEMBohAQVBpS/KC0DZOgwgCVh5lMcxkBTSl/kkSkf0wfAiGIPB0D0fhBBEMR2CkGRBEUFR1Fg0hdDeUgWAEJg0FwQgSAACSYBQLkwRgbhMIhdUYUzOCgnzmH8sySFi5TeCSvzUDKfAMpiWTNLYTgLh3Z4AE9lggVA9WWYITDodBxiogg8s4Fq8pk/xaHkxT8tY2gMSs4IyoIcquDqAhiAiyyQBo6ViCYaIEpffR31I9yKNK1AKqqmrgPqxrmto2D6KQVTonUsgKAgC6rpAYguEMmiaFoBdiBhCBgjI0hgj8CpypW37/uIcrtWCbRihgn9VLYQRtQYWhAfcrB6uAJwxFobioKwCdgHEFG8GmkpsUwbiTNUYpovmKC/AXIif1oPBgkW0GXCwH6ptZIHSeIYI4kwYFMDxpmjDovgDC4gA1PBMGldKgZs4RRHERylZctQfs8/RDGMID9GZmFYGYYqmeCZBSF5zQ3i4T44N/PVEm43jkEWhcNTG6JMCDZB0HKs3NW1cjeYsrAjbpepGkSBwGGcVwai8WORk6cI6nSBIBH6WoYjiDOGGTvIukGBoodKXoqnjgZI9LgQWkqAuxkGcus70IZ678DpC9TyYFFAuYJFfdbSC/H8KOWHSCD0rgDKMt0IAC8yIK4Y7oftxCdg0VC8MkM4zh2SQPhQz4NB2NbiMyyIaJH8j2ssOaTq0eimIgJABvYiApM4Xj%2BItISRLE5YElSBSS6nJBSSkgZ3XoMQIq2kgJTxnsZXgplAqh2srIOyqtpDqyUJrdyug6hZRSqgkKYUIoMCijFL2/grLcEyr5Yh5kFZ0O8gwnKrU/BA2iIVLSJVRqVQgCOBgGp9p1QageY65g2qUR6J1WSPVwH9VQK4IaIARo7TGhNUgC0lorUHhwD8w8fpbX4VVIRIjaqHQkbwVeZ0UBhkutA9%2BUCuiPWenwOg71PrfXcn9ZgoMgZ%2BIBuDSGNgfqw0YAQBGSMfqoxMOjTG2NeC411gTH8%2BBia2FJuTZBlNkDUyBnTBoP0zas3KuzeYP4uYsB5mEfmSghYiz8KAU6EsmDS1lvLahisMEqwctg2QGs3I/gITrMWE8DbBHDpMfaTsv4alds8TAHtypex9n7M2wcwihzJvAQRvCUFNEtmIEw8w3gaBtvbIoJR7AQEcC3OoPgO6jCLjnDIiQHlvLzg3V51ymh1wrikbOfyy7DGeSnVuzdK7AvLj87uUwZj92XmfQx19fycHHvA/ShljJVQXiQJeK86KTC3EwLA4QI5ERIt5S%2BRjNq32og/NeIAD7GUHB8XeOwPhPQ%2BCfM4Z8gFooojY4lZ83gbVHrfWxkxebxHsJIIAA%3D&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;To make this no-panic, we need to help the compiler out by reminding it that
this struct invariant holds in the critical path:&lt;/p&gt;

&lt;div class=&quot;language-rust highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nn&quot;&gt;hint&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;assert_unchecked&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;S&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&apos;a&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&apos;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;u8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;ofs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;usize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// Invariant: ofs &amp;lt; data.len()&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;impl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&apos;a&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;S&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&apos;a&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;check_invariant&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;unsafe&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;assert_unchecked&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;.ofs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;.data&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;u8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;Option&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;S&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;match&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;S&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ofs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.check_invariant&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;Some&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;u8&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.check_invariant&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;.data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;.ofs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;nd&quot;&gt;#[no_mangle]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;extern&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;C&quot;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;hello_world&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;S&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;u8&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This makes use of
&lt;a href=&quot;https://doc.rust-lang.org/std/hint/fn.assert_unchecked.html&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;std::hint::assert_unchecked&lt;/code&gt;&lt;/a&gt;,
a very sharp tool for making soundness promises to the compiler.  Here we use
it to inform the compiler of our struct invariant.  This has the desired effect
of making this “opt no-panic”:&lt;/p&gt;

&lt;iframe width=&quot;100%&quot; height=&quot;600px&quot; src=&quot;https://godbolt.org/e#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DEArgoKkl9ZATwDKjdAGFUtEywYgATKUcAZPAZMADl3ACNMYhAAZmjSAAdUBUI7Bhc3Dz1E5NsBAKDQlgiorktMa1yGIQImYgJ0908fK0wbVOragnyQ8MiYuPNOhszSwbruwuKJAEpLVBNiZHYOM0wAanN0EBAEQIJtpgUlOoB9EwZkBFaAa0x0AFJogCF7jQBBV7f4kzCNglMbGshI8nGAOExHtg1vcAOwvd5rRFrdBMGogaFeABsYIhAFYniYABz3XEAEVInyRa1QVAU6LMeAAXphSEiAPRstYASQYADdanhDPtqbTodEnMjUUwAHT0BgQaafWGkpXvPAseK0EE4yFA7Xg3Ww%2BFvKlUBhrS43E6BfnEQWCCD3LFWKjTaFwylUpHnBRMKjrI1rQ7HAhnC5XZC3dAQF3SmkKMUS2MomqyxgKt3Kz2IrPvbNrb6/M1rIIAdwgKaY6KdmJJBOJZLdAFpdQB5eKVEHA6JQo35qksVGXSWpuUK93Gr1TtYaMUqntrYICFn96fmx7z3sehHT3f0AgbOd6uGV9GV1nx9HmptrLhZ56r3eIhTSy2R618gVChWPSdPqdCKgbAxoqO7/rmJrThBVIQfmhZrMWwCYAQjrOuUrprC2C5EhOj4bOhr4RtcH62vaKGKg%2BYFesmUp1rG8YkiqVGwTCTEfHmXjRHWDCoCcg5GPQjGfPBmCqAQkTrl4XhOE6XgIeaVy0LQvGliQtDRnSGKYkIza6jhfZUS%2BSHkUqrEcLMtCcLivCeBwWikKgnAAEpmAeCjzIsAacTwpD7HZ5mzNcIC4tE0oAJwwpi0RSVFhJcNEYVRbi%2BicJINmaLwjkcLwdIaL5GWzHAsAwIgIBEK45CULUwB0hwTZNugQFMIEay8iwmHmtEvDVQoyiGOUQgIKgpa2T5aAanQkTBKwyymOYyC3tKhKSNK9l3IQJB4FspT8IIIhiOwUgyIIigqOo/mkLoPgsAITBoLgG3EAAEocTyYIwbwmEQ7aMHcnA%2BTdzD3fgRDED9o28IDd2oNU%2BAQwk/XTWwnBPMpqIAJ5rBAqAdmsYQmHQ6DTDlBBw5wJNw31QS0INw3w%2BVtBclsYRowQ6NcKU/zvVtIB5aWxBMPE/3mSlHDWaQtn2VlqOoBjWM4we%2BOE8T%2BX%2BYVSDjfEk1kBQECa9rIDEFwy15TQtDicQdIQGEGWkGEgS1Ojwt2w7xDo62YTaK0fn2eNbCCK2DC0E7F1YPjwBOGItC1T5WD8cA4ih3gxDe3gvKYLVa2qK0X3LD5ezlLbtB4GEAtuy4WC2/86rO%2BnxBhEkmCkpg8fF0YBV8AYNUAGp4Jgpbg87u3CKI4hHcPp1qLbV36IYxiufoJd0rAzDI8XYTIKQdeaF4XBYgFDkdqktX1cgAviU2bPxJgG7IOg6Pr5hraZXXdpYMvCplBUqQOAwziuI0PQ/hAg9CKH0Uo2QUgCGGJ4CBSQoEMAmL0EoX9U4CA6HUGBegWhtHQTUcYIDJjgMsPg%2BoACRgkM6EgsBJRZjuQWEsGYotxaS0ypwNYc0CALS4EtFas4IAPVBhiaIXAVY%2BwPkFaIGhwoJUkDCGE0RJCYjCliDQcRLIcDSqQFgwU8qsIcuTSwvNVZaEKiVCASAGaVQgD1Tg9VGqDham1DqawuqkB6lTAaQ0RrO31vQYgSNZquW4bw1avB1qgx5jtWQ%2B0x7SAnkoKeF1dClChsDR6L0FBvQ%2Bl9VA4M/rcEhrddJoNB6FO0cUmGpNAjO3iIjGaKNWaYwgDxBgTYFZ4wJupFWmwanZRIZTfqNNvH01QK4JmIAWayzZhzUg/NBbCwslZdKF1pZNKxq09puMlbdN4OI9WKAgJa38dYvxfQjYmz4HQC2VsbYXXtswN2zsHmOw9l7Gwts/aMAIIHYOtsw4mAjlHGOvA45z0TvZfAKc2jp0zuE7OyBc7OwLho%2By68y7owrssey1cdHlLrg3JQzdW6BFAGrTuTAe59wHtfeGw9YmHXibISe517IpNnu3Thi8wgf1mArY%2Bdimxn1RJgS%2B6Nr633vuvF%2BkQ34Z3gC0hpETUhbzECYZYXgNC7wPjgyov9/4ZFgb4P%2B1CphwJyKkLB5qEGmuIbq9opCrWoNwVUUhtqUFjDIYa7BbrCHIKYfQzyTCNEsNtllDhwTFrLVWljQRJBhGiL2QVWYVwmBYCiJ/DRWidG4j0WGwxuUTESJAEo1anFMTyOiJiY2mI1EwlFm4/RWUk1qyWRwLwKypaGP2bMOuyR7CSCAA%3D%3D&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;This definitely requires care; we have to be very sure that the predicate we
pass to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;assert_unchecked&lt;/code&gt; is true.  Luckily we can fuzz against this assertion
to increase our confidence (in debug mode, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;assert_unchecked&lt;/code&gt; will panic if the
condition is not true).  Used judiciously, it can be a powerful tool for
explicitly expressing to Rust the invariants we were relying on to make index
operations safe in C.&lt;/p&gt;

&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;

&lt;p&gt;No-Panic Rust is not for the faint of heart.  It requires a lot of careful,
detailed work, and forces you to give up some niceties of Rust, like the
standard library.  But if we are diligent, it can give us the performance, code
size, and error reporting behavior of a C library with the extra safety that
comes from Rust.&lt;/p&gt;

&lt;p&gt;This extra safety comes from the fact that Rust will automatically insert
bounds checks anywhere it cannot prove that an access is safe.  This puts
the burden on us to justify to the compiler in every case why the bounds
check is safe to elide.  In some cases this will mean detecting a bounds
violation explictly and reporting the error to the caller (especially in
parsers, where we do not know whether the input is valid or not).  In other
cases, we may know through a program invariant that the index will always
be in-bounds, and we will need to communicate this invariant to Rust.&lt;/p&gt;

&lt;p&gt;I should be clear that I have not yet attempted this technique at scale, so
I cannot report on how well it works in practice.  For now it is an exciting
future direction for upb, and one that I hope will pay off.&lt;/p&gt;

&lt;p&gt;To make this technique practical, we need a tool that can diagnose where a
panic handler was reachable from.  The main technique we used in this article
(looking at binary size) does not give us any information about where a panic
came from.  On macOS, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-why_live&lt;/code&gt; linker option is perfect for this.  I
hope other linkers like LLD will add support for this option also.  If not, a
standalone tool could be written that analyzes a binary after it’s linked to
find the chain of references that lead to a panic handler.&lt;/p&gt;

&lt;p&gt;It would be nice if Rust made it easier to stay within the no-panic subset.
It’s clear that writing no-panic code is not a core use case that the language
focuses on, but there are many situations (embedded, Linux Kernel, etc) where
we want to avoid panics.  It would be nice if functions or even crates could
advertise themselves as no-panic and have the compiler enforce this
transitively.  Changing a function from no-panic to panicking would then be an
API-breaking change.&lt;/p&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:2&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;There is some interesting discussion in &lt;a href=&quot;https://internals.rust-lang.org/t/enforcing-no-std-and-no-panic-during-build/14505&quot;&gt;Enforcing no-std and no-panic
  during
  build&lt;/a&gt;,
  which links to some relevant Linux kernel mailing list threads.  Another
  interesting thread is &lt;a href=&quot;https://users.rust-lang.org/t/negative-views-on-rust-panicking/69796&quot;&gt;Negative view on Rust: panicking&lt;/a&gt;. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:0&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;If we are willing to go &lt;a href=&quot;https://docs.rust-embedded.org/book/intro/no-std.html&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;#![no_std]&lt;/code&gt;&lt;/a&gt;,
  we can mitigate this code size overhead by writing our own
  &lt;a href=&quot;https://doc.rust-lang.org/nomicon/panic-handler.html&quot;&gt;panic handler&lt;/a&gt;,
  which we could engineer to be much smaller than the std one.
  This does address the code size concern, but it does not compose
  well, as there can only be one panic handler for an entire binary,
  so it doesn’t make sense for a library to provide one. &lt;a href=&quot;#fnref:0&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Some Rust panics can technically be caught with &lt;a href=&quot;https://doc.rust-lang.org/std/panic/fn.catch_unwind.html&quot;&gt;catch_unwind&lt;/a&gt;,
  but this is full of caveats and is not designed as an error recovery
  mechanism. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:6&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;We choose &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cdylib&lt;/code&gt; instead of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;staticlib&lt;/code&gt; for this exercise, because
  &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cdylib&lt;/code&gt; will invoke the linker to produce a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.so&lt;/code&gt;. This will
  perform a garbage-collection pass that discards unreachable code,
  so that we only count code that would actually get pulled in if
  you statically linked the library into a binary. &lt;a href=&quot;#fnref:6&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;There are some crates specifically designed to test directly for
  the no-panic condition, including
  &lt;a href=&quot;https://docs.rs/no-panic/latest/no_panic/&quot;&gt;no_panic&lt;/a&gt;,
  &lt;a href=&quot;https://docs.rs/panic-never/latest/panic_never/&quot;&gt;panic_never&lt;/a&gt;, and
  &lt;a href=&quot;https://docs.rs/no-panics-whatsoever/latest/no_panics_whatsoever/&quot;&gt;no_panics_whatsoever&lt;/a&gt;.
  None of them are ideal for our purposes (the second two are only for
  binaries, not libraries, and the first has to be applied to every
  relevant function individually), and anyway none of them are
  available in Godbolt.  These generally work by trying to create
  linker errors if the panic handler is linked in. &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Unfortunately this does not appear to work with the nightly Rust
  compiler.  On nightly, &lt;a href=&quot;https://godbolt.org/z/r8WvG6GeK&quot;&gt;the resulting binary is &amp;gt;300Kb&lt;/a&gt;,
  indicating that the panic runtime was linked in.  This seems like a
  regression, and I have not been able to diagnose why this is. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Mon, 03 Feb 2025 00:00:00 +0000</pubDate>
        <link>https://blog.reverberate.org/2025/02/03/no-panic-rust.html</link>
        <guid isPermaLink="true">https://blog.reverberate.org/2025/02/03/no-panic-rust.html</guid>
        
        
      </item>
    
    
      <item>
        <title>An Ode to Header Files</title>
        <description>&lt;p&gt;C and C++ have a somewhat distinctive feature that almost no language since has
decided to replicate, which is to put public API declarations into separate
files called &lt;em&gt;header files&lt;/em&gt;:&lt;/p&gt;

&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// square.h: defines public API&lt;/span&gt;

&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;SquareArray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;size_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// square.c: defines implementation&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Internal-only helper.&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;SquareNumber&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Implementation of public API.&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;SquareArray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;size_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;size_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SquareNumber&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]);&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;More “modern” languages almost universally choose to collapse header and source
into a single file, where public functions are marked in some special way:&lt;/p&gt;

&lt;div class=&quot;language-rust highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Rust: functions are exported with &quot;pub&quot;&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;square_number&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;i32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;i32&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;square_array&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;mut&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;i32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.iter_mut&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;square_number&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;language-java highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Java: functions are exported with &quot;public&quot;&lt;/span&gt;

&lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Square&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;squareNumber&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

  &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;squareArray&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[]&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;length&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;squareNumber&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]);&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I think this move away from header files is unfortunate. Separating public API
declarations into their own files offers many benefits that cut to the heart of
good software engineering practice.  In this blog post I will articulate what
these benefits are.  My hope is that modern languages might consider adopting
something like header files (a few do to some extent, which I will explain later).&lt;/p&gt;

&lt;h1 id=&quot;an-obsolete-mechanism-repurposed&quot;&gt;An Obsolete Mechanism, Repurposed&lt;/h1&gt;

&lt;p&gt;You may find it strange that I would advocate for header files, given that they
are effectively obsolete, at least compared to their original purpose.  Header
files were initially designed to solve a &lt;em&gt;technical&lt;/em&gt; problem for the compiler,
which is how to share macros and function declarations between translation
units in a way that supports separate compilation.&lt;/p&gt;

&lt;p&gt;But newer languages have convincingly demonstrated that separate compilation
can be achieved without header files, and especially without the primitive and
problematic paradigm of textual inclusion (via &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;#include&lt;/code&gt;), which is not
hygienic, meaning it can lead to namespace collisions, unbalanced scopes, and
numerous other practical problems.&lt;/p&gt;

&lt;p&gt;So why do I advocate for an obsolete language feature?  Because I believe that
putting a module’s public API into a separate file is a genuinely good practice
on the merits.  Even if the &lt;em&gt;machine&lt;/em&gt; no longer needs us to do it, it is good
for &lt;em&gt;humans&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;I’m arguing for an updated, “modern” version of header files.  This implies
two clear breaks with C and C++:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;We should obviously not use textual inclusion, but instead make headers
hygienic and modular, a la
&lt;a href=&quot;https://en.cppreference.com/w/cpp/language/modules&quot;&gt;C++20 modules&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;We should allow headers to contain precisely the parts of a module’s API
that are public, no more and no less.  They should contain no implementation
details whatsoever&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;C and C++ both traditionally violate (1), but C++ has tackled this problem
head-on with &lt;a href=&quot;https://en.cppreference.com/w/cpp/language/modules&quot;&gt;C++20
modules&lt;/a&gt;, which should
theoretically solve the problem.&lt;sup id=&quot;fnref:0&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:0&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;C and C++ also violate (2) in many ways.  For example, C++ requires private
member variables and functions to be declared in the class body (which will go
into the header file for any public class), even though they are not part of
the public API.  Template functions also have to be defined in header files if
users will instantiate them, even though the function definitions are
implementation details.  C and C++ both require a function to be defined in a
header if you want it to be inlinable.  For these reasons and others, C and C++
force a lot of implementation details into headers, which goes against my
vision of what header files should be.&lt;/p&gt;

&lt;h1 id=&quot;the-case-for-header-files&quot;&gt;The Case for Header Files&lt;/h1&gt;

&lt;p&gt;The case for header files boils down to this: the split between interface and
implementation is our primary tool in the struggle against software complexity.
It is how we break software down into pieces that can be implemented, tested,
and evolved separately from each other.  A header file is a place to specify
the interface contract between one component and the outside world.&lt;/p&gt;

&lt;center&gt;&lt;img width=&quot;550&quot; src=&quot;/img/Headers/InterfaceVsImplementation.jpg&quot; /&gt;&lt;/center&gt;

&lt;p&gt;The header file describes the user’s view of a software module, which is
comprised of both the formal function declarations and the comments describing
the semantics of those functions.  Together these form the contract of the
module’s API.  If there is disagreement about whether a particular behavior is
a bug or a feature, the header file is the contract we use to litigate what is
promised vs what is not.&lt;/p&gt;

&lt;p&gt;Without header files, where is this interface specified?  It is sprinkled
across a mound of implementation details.  There is no way to focus specifically
on the parts of the API that are public.  It’s like going to a restaurant and
being presented with a menu that has full recipes for each item, and trying to
glean from each recipe what the dish will be like.&lt;/p&gt;

&lt;p&gt;The public interface can be extracted from this jumble by a tool like a
documentation generator, which filters out the non-public functions and methods
and generates some HTML.  Indeed, a tool like
&lt;a href=&quot;https://en.wikipedia.org/wiki/Javadoc&quot;&gt;Javadoc&lt;/a&gt; or
&lt;a href=&quot;https://doc.rust-lang.org/rustdoc/what-is-rustdoc.html&quot;&gt;rustdoc&lt;/a&gt; is the
closest we can get to header files in many modern languages, and this does
allow us to view the public API as a unified thing.&lt;/p&gt;

&lt;p&gt;But generated docs are not integrated into the software authoring, versioning,
and review process.  They aren’t source files, so you can’t easily diff, blame,
or comment on changes to them during code review.  You probably won’t see them
in your IDE or in the GitHub source browser.  Doc generators are useful, but
they cannot offer a versioned record of the public API the way header files
can.&lt;/p&gt;

&lt;h1 id=&quot;auto-generated-headers&quot;&gt;Auto-Generated Headers?&lt;/h1&gt;

&lt;p&gt;One of the main objections to header files is that they violate the principle
of &lt;a href=&quot;https://en.wikipedia.org/wiki/Don%27t_repeat_yourself&quot;&gt;don’t repeat
yourself&lt;/a&gt;.  A header
file must be updated whenever the corresponding source file changes (if a
public API was affected). This is busy work, which is tedious and a drag on
developer productivity.&lt;/p&gt;

&lt;p&gt;What if we take inspiration from doc generators and make a tool do the work
for us?  What if we tweaked a doc generator to spit out code instead of HTML?&lt;/p&gt;

&lt;p&gt;Let’s take the following example in Rust:&lt;/p&gt;

&lt;div class=&quot;language-rust highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// example.rs: a normal Rust source file.&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Foo&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;i32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;i32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;cd&quot;&gt;/// The Bar type.&lt;/span&gt;
&lt;span class=&quot;nd&quot;&gt;#[repr(C)]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Bar&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;i32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;i32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;impl&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Foo&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Foo&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Foo&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;impl&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Bar&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;cd&quot;&gt;/// Constructs a new Bar.&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Bar&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Bar&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

  &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;add&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;mut&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;i32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;.a&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;.b&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;You could imagine the compiler generating the following header file which
specifies the public API:&lt;/p&gt;

&lt;div class=&quot;language-rust highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// example.api.rs: A header file describing the public API.&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// The compiler has removed all non-public APIs and all implementation.  It has&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// kept the rustdoc comments, because those comments are part of the API&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// contract.&lt;/span&gt;

&lt;span class=&quot;cd&quot;&gt;/// The Bar type.&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Bar&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;i32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;impl&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Bar&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;cd&quot;&gt;/// Constructs a new Bar.&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Bar&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For projects that care about ABI stability, you could even imagine the compiler
generating a separate “ABI Header” file:&lt;/p&gt;

&lt;div class=&quot;language-rust highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// example.abi.rs: A header file describing the public ABI.&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// The compiler has emitted ABI information about all public APIs. We only&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// need to include ABI information that is not specified in the API itself;&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// this includes:&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//   - struct members offsets.&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//   - enum constant numbers.&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//   - definitions of any function that is allowed to be inlined.&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Rust only supports a stable ABI for `#[repr(C)]`, so ABI information is&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// only emitted for types that use the C ABI.&lt;/span&gt;

&lt;span class=&quot;nd&quot;&gt;#[abi(size=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nd&quot;&gt;align=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;)]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Bar&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nd&quot;&gt;#[abi(offset=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;)]&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;i32&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This (hypothetical) compiler has done quite an interesting thing for us.  It
has extracted all aspects of our source file that are API- or ABI-impacting
into a readable and parseable form.  As we change the source file, we can
monitor these generated header files to see if our changes affected the API or
ABI.&lt;/p&gt;

&lt;p&gt;Suppose we change &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Bar::a&lt;/code&gt; (a private member) from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i32&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bool&lt;/code&gt;.  Does it
change the public API?  No, because &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Bar::a&lt;/code&gt; is private, so &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;example.api.rs&lt;/code&gt;
will be unchanged.&lt;/p&gt;

&lt;p&gt;Does changing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Bar::a&lt;/code&gt; from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i32&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bool&lt;/code&gt; change the public ABI?  Again no,
because padding will cause &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Bar::b&lt;/code&gt;’s offset to remain &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;4&lt;/code&gt;.  So this change
will not result in any diff to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;example.abi.rs&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;But what if we changed &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Bar::a&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i64&lt;/code&gt;?  Then &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Bar::b&lt;/code&gt;’s offset will change to
8, which &lt;em&gt;is&lt;/em&gt; an ABI break, and this will be reflected by a corresponding
change in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;example.abi.rs&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is useful information for humans, but it could also be a signal to the
build system.  How do we know if dependent modules need to be recompiled?
Only if our API or ABI changed, which we can detect by monitoring changes
to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;example.api.rs&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;example.abi.rs&lt;/code&gt;.&lt;sup id=&quot;fnref:5&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;We will want to generate these files as eagerly as possible, so we get early
feedback.  API headers should definitely be checked into source control, so
that they are versioned and visible to code review.  Checking in ABI headers
will probably be overkill for projects that do not care about ABI stability.&lt;/p&gt;

&lt;h1 id=&quot;benefits&quot;&gt;Benefits&lt;/h1&gt;

&lt;p&gt;There are many practical examples of cases where header files help with software
engineering.  I will argue this case with some specific examples.&lt;/p&gt;

&lt;h2 id=&quot;deep-modules&quot;&gt;Deep Modules&lt;/h2&gt;

&lt;p&gt;In John Ousterhout’s book &lt;em&gt;A Philosophy of Software Design&lt;/em&gt;, he argues that one
of the most important design principles in software is to make modules &lt;em&gt;deep&lt;/em&gt;,
such that a relatively small interface hides a large amount of implementation.&lt;/p&gt;

&lt;p&gt;He illustrates this point in the following diagram, which is taken from the
book:&lt;/p&gt;

&lt;center&gt;&lt;img border=&quot;1&quot; width=&quot;550&quot; src=&quot;/img/Headers/DeepModules.jpg&quot; /&gt;&lt;/center&gt;

&lt;p&gt;Over the years I have come to solidly agree with this principle.  The best
designs are ones where the ratio of interface size to implementation size is
kept as low as possible.  The motivation is to &lt;em&gt;hide complexity&lt;/em&gt;, so that a
user of your module does not have to see all of the implementation details.&lt;/p&gt;

&lt;p&gt;How can we keep tabs on how deep our modules are?  If we have some code open
in an editor, or in a code review, how can we tell how well we are doing?&lt;/p&gt;

&lt;p&gt;If we are using header files, the answer to this question is pretty simple:
just compare the size of the header file to the size of the source file.  The
deepest modules will have very small header files – maybe only a single
function! – and much larger source files.  We should get a good feeling when
our header files are small, and a sense of dread when they are large and
sprawling.  Big header files are a code smell.&lt;/p&gt;

&lt;p&gt;For languages that do not use headers, it is much more difficult to eyeball how
deep a module is.  A deep module would be one where the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;public&lt;/code&gt; keyword
appears only a few times in a file that is hundreds or thousands of lines long.
But this is not easy to measure visually, and it is not easy to spot in code
review.&lt;/p&gt;

&lt;p&gt;Searching for the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;public&lt;/code&gt; keyword doesn’t even really answer the question,
because a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;public&lt;/code&gt; method could be a public API of a private type, in which
case it is not truly exported from the module.  For example, consider the
following code in Java:&lt;/p&gt;

&lt;div class=&quot;language-java highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Square&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Squarer&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[]&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;// These are public methods, but on a private type, so not truly&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// part of the module&apos;s public API!&lt;/span&gt;
    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;Squarer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[]&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;this&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;arr&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;square&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;length&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;squareNumber&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]);&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

  &lt;span class=&quot;c1&quot;&gt;// This is the only truly public API.&lt;/span&gt;
  &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;squareArray&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[]&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;Squarer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;square&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;public&lt;/code&gt; keyword shows up three times in this source file, but only one
of these is actually visible outside of this file.  So it is actually quite
hard to gauge how deep a module is in most programming languages.  With header
files, it is trivial.&lt;/p&gt;

&lt;h2 id=&quot;software-versioning&quot;&gt;Software Versioning&lt;/h2&gt;

&lt;p&gt;For people who maintain software libraries, one of the key questions we have to
ask is whether a given change will break users or not.  Breaking changes are
intrusive, because they force users to change their code.&lt;/p&gt;

&lt;p&gt;This principle is built into versioning schemes like
&lt;a href=&quot;https://semver.org/&quot;&gt;SemVer&lt;/a&gt;, which requires that any breaking API change be
accompanied by a major version bump.&lt;/p&gt;

&lt;p&gt;How can we quickly evaluate whether a given change might be breaking or not?
If we are using header files, we can look at whether a given change modifies a
header file or not.  If no header files are changing, it guarantees that we are
not changing the API contract (the change could introduce a bug that &lt;em&gt;violates&lt;/em&gt;
the API contract, but that is a separate issue).  Conversely, if header files
are changing, there is a high likelihood that public APIs are being either
added or changed.&lt;sup id=&quot;fnref:5:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;As an illustration, let’s play a little game. Here are the Git diffstats for
four different changes in the Protobuf repository.  Two of them are breaking
changes and two are not.  Can you guess which is which?&lt;/p&gt;

&lt;pre&gt;
$ git diff --stat 63623a688c0f4329cfe4161bbaa2666f34c2be33^ 63623a688c0f4329cfe4161bbaa2666f34c2be33
 .../main/java/com/google/protobuf/Descriptors.java | 10 &lt;span style=&quot;color: green;&quot;&gt;+&lt;/span&gt;&lt;span style=&quot;color: red;&quot;&gt;--------&lt;/span&gt;
 .../com/google/protobuf/util/ProtoFileUtil.java    | 21 &lt;span style=&quot;color: green;&quot;&gt;+++++++++++++++++++&lt;/span&gt;
 .../google/protobuf/util/ProtoFileUtilTest.java    | 24 &lt;span style=&quot;color: green;&quot;&gt;++++++++++++++++++++++&lt;/span&gt;
 3 files changed, 46 insertions(+), 9 deletions(-)

$ git diff --stat e3cc31a12eaddcfaaa5a27c272e240b6cbd985c8^ e3cc31a12eaddcfaaa5a27c272e240b6cbd985c8
 .../com/google/protobuf/AbstractMessageLite.java   | 10 &lt;span style=&quot;color: green;&quot;&gt;+++++++&lt;/span&gt;&lt;span style=&quot;color: red;&quot;&gt;--&lt;/span&gt;
 .../com/google/protobuf/ProtobufArrayList.java     | 26 &lt;span style=&quot;color: green;&quot;&gt;++++++++++++++++++&lt;/span&gt;&lt;span style=&quot;color: red;&quot;&gt;----&lt;/span&gt;
 2 files changed, 30 insertions(+), 6 deletions(-)

$ git diff --stat e2eb0a19aa95497c8979d71031edbbab721f5f0a^ e2eb0a19aa95497c8979d71031edbbab721f5f0a
 src/google/protobuf/util/json_util.h | 3 &lt;span style=&quot;color: red;&quot;&gt;---&lt;/span&gt;
 1 file changed, 3 deletions(-)

$ git diff --stat f549fc3ccc4ea096fcbc66f74c763880cd26e451^ f549fc3ccc4ea096fcbc66f74c763880cd26e451
 src/google/protobuf/descriptor.cc | 14 &lt;span style=&quot;color: green;&quot;&gt;+++++&lt;/span&gt;&lt;span style=&quot;color: red;&quot;&gt;---------&lt;/span&gt;
 1 file changed, 5 insertions(+), 9 deletions(-)
&lt;/pre&gt;

&lt;p&gt;C++ has an unfair advantage in this game.  When a C++ change only touches &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.cc&lt;/code&gt;
files, we are guaranteed that the API contract is not changing, so we know that
change #4 does not break API or ABI.  When a change
touches &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.java&lt;/code&gt; files, we get no hint about whether the change affects the API,
implementation, or both.  It turns out that change #1 is API-breaking and change
#2 merely changes internal implementation details, but the Git diffstats offer
no clue about this.&lt;sup id=&quot;fnref:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;Rust has tools for checking for breaking changes: I found
&lt;a href=&quot;https://github.com/obi1kenobi/cargo-semver-checks&quot;&gt;cargo-semver-checks&lt;/a&gt;,
&lt;a href=&quot;https://github.com/cargo-public-api/cargo-public-api&quot;&gt;cargo-public-api&lt;/a&gt;, and
&lt;a href=&quot;https://github.com/rust-lang/rust-semverver&quot;&gt;rust-semverver&lt;/a&gt;.  These
tools can be set up to run in CI workflows, and will raise errors if a public
API was added or changed without a corresponding version bump.  A tool like
this can arguably solve the problem, without any need for header files.&lt;/p&gt;

&lt;p&gt;But there is a price to making the API checker a separate tool.  It means that
people need to opt in by manually setting up the tool in their workflow, which
few projects actually do.  A 2023 study found that &lt;a href=&quot;https://predr.ag/blog/semver-violations-are-common-better-tooling-is-the-answer/&quot;&gt;17.2% of Rust crates have
at least one semver
violation&lt;/a&gt;.
And this only counts the violations that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cargo-semver-checks&lt;/code&gt; is able to
detect; the tool is known to be incomplete and will miss some issues.&lt;/p&gt;

&lt;p&gt;The authors of the study argue that “This is a failure of tooling, not humans”,
and I agree.  It is not up to humans to be perfectly diligent about API breaks,
it is up to the tooling to surface these issues as conveniently as possible.
What better way to do this than to make interface definition a part of the
language itself, as code that can be viewed in your editor and in version
control?&lt;/p&gt;

&lt;p&gt;If you are changing a public API, you want to get this feedback as early as
possible.  The ideal time to get that feedback is in your source code editor,
the moment when you type the change.  The second-best time is when you run the
compiler.  The third-best time is in CI.  The fourth-best time is in code
review.  The fifth-best time is when you go to actually perform a release.  The
worst time is from a user, once you have broken their build.  Early feedback is
key, and header files provide early feedback by surfacing API/ABI changes as
file diffs at the source code editing stage.&lt;/p&gt;

&lt;p&gt;Header files also let you view API diffs and versions using a normal &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;git
diff&lt;/code&gt;, rather than having to invoke a special tool.  So it’s very cool that
tools like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cargo-semver-checks&lt;/code&gt; exist, but I think header files could take
it to the next level by versioning the API declarations themselves.&lt;/p&gt;

&lt;h2 id=&quot;precompiled-libraries&quot;&gt;Precompiled Libraries&lt;/h2&gt;

&lt;p&gt;Suppose you are shipping precompiled libraries (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.a&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.so&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.dll&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.dylib&lt;/code&gt;
etc).  What can you ship along with the compiled library that will describe
the APIs that are contained in that library?&lt;/p&gt;

&lt;p&gt;Header files have always been a natural fit for this, because they contain the
API definition without any associated implementation.  For example, let’s look
at the Ubuntu package for zlib (filtering out examples and manpages):&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ dpkg -L zlib1g-dev | grep -v share
/.
/usr
/usr/include
/usr/include/zconf.h
/usr/include/zlib.h
/usr/lib
/usr/lib/aarch64-linux-gnu
/usr/lib/aarch64-linux-gnu/libz.a
/usr/lib/aarch64-linux-gnu/pkgconfig
/usr/lib/aarch64-linux-gnu/pkgconfig/zlib.pc
/usr/lib/aarch64-linux-gnu/libz.so
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The header and the precompiled library are a natural pair: the header describes
the API, while the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.a&lt;/code&gt;/&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.so&lt;/code&gt; implements it.&lt;/p&gt;

&lt;p&gt;How does one distribute precompiled libraries in a language without headers?
Looking at the few languages we opened the article with:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Java&lt;/strong&gt;: Compiled &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.jar&lt;/code&gt; files do not contain a human-readable description
of the interface.  You have to go looking for Javadoc somewhere, and hope it
matches the version of the library you have installed.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Rust&lt;/strong&gt;: Rust does not have a stable ABI, and therefore does not support
distributing precompiled libraries.  For the C ABI, you would probably just
distribute a C header file.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is another use case for header files, which can represent an API/ABI
without including any implementation.&lt;/p&gt;

&lt;h1 id=&quot;honorable-mentions&quot;&gt;Honorable Mentions&lt;/h1&gt;

&lt;p&gt;While no modern language has fully embraced header files, there are a few
that deserve partial credit.&lt;/p&gt;

&lt;h2 id=&quot;python--ruby-interface-files&quot;&gt;Python &amp;amp; Ruby: Interface Files&lt;/h2&gt;

&lt;p&gt;Python and Ruby have both evolved some static typing infrastructure that
augments the native dynamic typing of the language.  As part of this
trend, both languages allow you to declare APIs in a separate file
from the implementation.&lt;/p&gt;

&lt;p&gt;In Python, these are &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.pyi&lt;/code&gt; files:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Python uses .pyi files for static API declarations.
# These are called &quot;stub files&quot; and are standardized in PEP 484:
#    https://peps.python.org/pep-0484/#stub-files
&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;square_list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ls&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Sequence&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]):&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In Ruby, these are &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.rbs&lt;/code&gt; files:&lt;/p&gt;

&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Ruby uses .rbs files for static API declarations.&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# These are known as type signatures, and are standardized in:&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;#    https://github.com/ruby/rbs&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;square_array&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;ss&quot;&gt;arr: &lt;/span&gt;&lt;span class=&quot;no&quot;&gt;Array&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This design ticks a lot of the boxes I’m arguing for. But what it lack is
completeness: there is no guarantee that all public APIs will be listed in
these files. An API can be called even if it is in the source file (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.py&lt;/code&gt; or
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.rb&lt;/code&gt;) only.&lt;/p&gt;

&lt;p&gt;This means we do not get the guarantee we want, which is that changes affecting
source files only are perfectly API-preserving.&lt;/p&gt;

&lt;h2 id=&quot;kotlin-expectactual&quot;&gt;Kotlin: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;expect&lt;/code&gt;/&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;actual&lt;/code&gt;&lt;/h2&gt;

&lt;p&gt;Kotlin is aimed at cross-platform development, and it has a language feature
designed to let you specify an API that will have different implementations
on different platforms.&lt;/p&gt;

&lt;div class=&quot;language-kotlin highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Kotlin Common API can contain &quot;expect&quot; declarations, which soecify an API&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// without providing any implementation.&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;expect&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fun&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;squareArray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Array&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;&amp;gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;language-kotlin highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;fun&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;squareNumber&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Int&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Implementation for a given platform.&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;actual&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fun&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;squareArray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Array&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;&amp;gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;indices&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;squareNumber&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This bears some similarity to my preferred solution.  But this mechanism does
not attempt to separate interface from implementation for all APIs, only for
APIs that have different implementations on different platforms.  APIs that
define a single implementation for all platforms do not use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;expect&lt;/code&gt; and
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;actual&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-kotlin highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Kotlin Common code can define function implementations without using&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// `expect` or `actual`.&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;fun&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;squareNumber&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Int&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;fun&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;squareArray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Array&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;&amp;gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;indices&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;squareNumber&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:0&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;This is of course assuming that people actually adopt C++20 modules.
   I do not have much experience with modules myself, so I can’t guess
   how this will go. &lt;a href=&quot;#fnref:0&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;If we are tracking ABI, the same is also true of ABI stability: any commit
  that changes a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;foo.abi.rs&lt;/code&gt; file would imply an ABI break. &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:5:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;In reality, many non-breaking changes in C++ touch &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.h&lt;/code&gt; files, due to the
  shortcomings of C++ headers I described earlier.  But the ideal I am arguing
  for does not suffer this problem, since I argue that header files should &lt;em&gt;only&lt;/em&gt;
  contain public APIs. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Mon, 27 Jan 2025 00:00:00 +0000</pubDate>
        <link>https://blog.reverberate.org/2025/01/27/an-ode-to-header-files.html</link>
        <guid isPermaLink="true">https://blog.reverberate.org/2025/01/27/an-ode-to-header-files.html</guid>
        
        
      </item>
    
    
      
    
      <item>
        <title>Arenas and Rust</title>
        <description>&lt;p&gt;For a while I’ve been wondering what it would be like to use
&lt;a href=&quot;https://en.wikipedia.org/wiki/Region-based_memory_management&quot;&gt;arenas&lt;/a&gt; in Rust.
In C and C++ I have been turning to arenas more and more as a fast alternative
to heap allocation.  If you have a bunch of objects that share a common
lifetime, arenas offer cheaper allocation and &lt;em&gt;much&lt;/em&gt; cheaper deallocation than
the heap.  The more I use this pattern, the more it feels downright wasteful to
use heap allocation when an arena would do.&lt;/p&gt;

&lt;p&gt;I’ve been wanting to know how arenas would play with Rust’s lifetime semantics.
An arena must always outlive all the objects allocated from that arena.  Rust’s
lifetime system seems ideal for expressing a condition like this.  I was
curious to see how this plays out in practice.&lt;/p&gt;

&lt;h1 id=&quot;arena-apis&quot;&gt;Arena APIs&lt;/h1&gt;

&lt;h2 id=&quot;c-and-c&quot;&gt;C and C++&lt;/h2&gt;

&lt;p&gt;First I will present the arena APIs I am familiar with in C and C++.  Here is a
simplified version of the C++ &lt;a href=&quot;https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.arena#Arena&quot;&gt;Arena API for
protobuf&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;language-c++ highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// The C++ Arena is thread-safe (Functions taking Arena* may be called&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// concurrently, except the destructor).&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Arena&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
 &lt;span class=&quot;nl&quot;&gt;public:&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;Arena&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;~&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Arena&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Frees all objects in the arena.&lt;/span&gt;

  &lt;span class=&quot;c1&quot;&gt;// Creates an object on the arena. The object is freed when the arena is&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// destroyed.  The destructor will be run unless it is trivial.&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;template&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;class&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;...&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Args&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Create&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Arena&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arena&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Args&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;...&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Test&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;Arena&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arena&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Arena&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Create&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arena&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Arena&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Create&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arena&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i3&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Arena&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Create&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arena&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

  &lt;span class=&quot;c1&quot;&gt;// Use i1, i2, i3...&lt;/span&gt;

  &lt;span class=&quot;c1&quot;&gt;// When the arena is destroyed, the individual objects are freed.&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here is a similar but somewhat different example in C, from the &lt;a href=&quot;https://github.com/protocolbuffers/upb/blob/main/upb/upb.h#L155-L239&quot;&gt;upb protobuf library&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// The C arena is thread-compatible, but not thread-safe (functions that take&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// upb_arena* may not be called concurrently).&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;upb_arena&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;upb_arena_new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Frees all memory in the arena.&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;upb_arena_free&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;upb_arena&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Allocates so memory from the arena.&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;upb_arena_malloc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;upb_arena&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;size_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;test&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;upb_arena&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arena&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upb_arena_new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;

  &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upb_arena_malloc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arena&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;sizeof&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upb_arena_malloc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arena&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;sizeof&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i3&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upb_arena_malloc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arena&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;sizeof&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;

  &lt;span class=&quot;c1&quot;&gt;// Use i1, i2, i3...&lt;/span&gt;

  &lt;span class=&quot;n&quot;&gt;upb_arena_free&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arena&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// All the individual objects are freed.&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In both the C and C++ versions, it is the user’s responsibility to make sure
that arena-allocated objects are not used past the lifetime of the arena.  C
and C++ are not memory safe languages, and they offer no lifetime checking that
would help us avoiding dangling pointers here.&lt;/p&gt;

&lt;p&gt;Even techniques like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;unique_ptr&lt;/code&gt; can’t help us here since the objects cannot
be freed independently.  We could get dynamic checking by using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;unique_ptr&lt;/code&gt;
with a custom deleter that decrements a refcount on the arena, and then panic if
an arena is destroyed with a non-zero refcount.  This would be better than
nothing, but it’s still a runtime check (not compile time).  In any case
neither arena uses this technique at the moment, so respecting the arena’s
lifetime is entirely the responsibility of the user.&lt;/p&gt;

&lt;p&gt;The C and C++ versions above have different thread-safety properties: the C++
arena is thread-safe (concurrent allocations are allowed), while the C arena is
thread-compatible (concurrent allocations are not allowed, because they access
a mutable pointer).  This means the C++ API is paying an efficiency/complexity
overhead to allow concurrent mutable access.&lt;/p&gt;

&lt;h2 id=&quot;rust&quot;&gt;Rust&lt;/h2&gt;

&lt;p&gt;The most established Rust Arena library I could find was called
&lt;a href=&quot;https://github.com/fitzgen/bumpalo&quot;&gt;Bumpalo&lt;/a&gt;.  Its API is slightly different
than both of these:&lt;/p&gt;

&lt;div class=&quot;language-rust highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Bump&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;cm&quot;&gt;/* ... */&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;impl&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Bump&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Bump&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;cm&quot;&gt;/* ... */&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;cd&quot;&gt;/// Allocate an object in this `Bump` and return an exclusive reference to&lt;/span&gt;
    &lt;span class=&quot;cd&quot;&gt;/// it.&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alloc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;val&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;mut&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;cm&quot;&gt;/* ... */&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Bump cannot be shared between threads!&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;impl&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;Sync&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Bump&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{}&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;impl&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;Drop&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Bump&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;cd&quot;&gt;/// Frees all the objects in the arena, without calling their Drop.&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;drop&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;mut&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;cm&quot;&gt;/* ... */&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;Test&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bump&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;Bump&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bump&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.alloc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bump&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.alloc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bump&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.alloc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;// When bump is dropped, the integers are freed.&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Bump::alloc()&lt;/code&gt; function returns a mutable reference &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;amp;mut T&lt;/code&gt;.  Rust’s
lifetime system will statically ensure that the reference doesn’t outlive the
arena.  We will explore some consequences of this below.&lt;/p&gt;

&lt;p&gt;From a thread-safety perspective, Rust’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Bump&lt;/code&gt; is distinct from both the
C and C++ arenas above.  Like the C arena, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Bump&lt;/code&gt; does not perform any internal
synchronization, so it avoids the overheads of the C++ thread-safe arena.
But unlike the C arena, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Bump&lt;/code&gt; allows allocation through an immutable reference;
in other words it uses “interior mutability.”  To ensure that it is not used
concurrently, it expressly forbids sharing between threads by being &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;!Sync&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If we use the analysis in &lt;a href=&quot;https://blog.reverberate.org/2021/12/18/thread-safety-cpp-rust.html&quot;&gt;my previous article&lt;/a&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Bump&lt;/code&gt; lives in the upper right quadrant
with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Cell&lt;/code&gt; (indeed &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Bump&lt;/code&gt; internally uses &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Cell&lt;/code&gt;, which is what prevents it
from being &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Sync&lt;/code&gt;).&lt;/p&gt;

&lt;h1 id=&quot;why-is-rusts-bump-not-sync&quot;&gt;Why is Rust’s “Bump” not “Sync”?&lt;/h1&gt;

&lt;p&gt;An interesting question is why &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Bump&lt;/code&gt; in Rust chooses not to implement &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Sync&lt;/code&gt;.
If we want to avoid synchronization overhead, our two main choices are:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Avoid interior mutability (use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;amp;mut self&lt;/code&gt;), and implement &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Sync&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;Use interior mutability (use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;amp;self&lt;/code&gt;), and do not implement
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Sync&lt;/code&gt; (likely using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Cell&lt;/code&gt; internally).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;(1) is safe because it only mutates through a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;amp;mut&lt;/code&gt; reference, which is
guaranteed to be unique and therefore cannot race with anything.  (2) is safe
because while mutation can happen from any reference, all references are bound
to a single thread, so this cannot create a data race.&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Bump&lt;/code&gt; chooses (2), but we could imagine an alternate world where it had
chosen (1) instead:&lt;/p&gt;

&lt;div class=&quot;language-rust highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SyncBump&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;cm&quot;&gt;/* ... */&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Alternate version of Bump that is sync, but stil avoids syncronization&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// overhead:&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;impl&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SyncBump&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Bump&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;cm&quot;&gt;/* ... */&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;cd&quot;&gt;/// NOTE: takes a &amp;amp;mut self instead of &amp;amp;self.&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alloc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;mut&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;val&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;mut&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;cm&quot;&gt;/* ... */&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Unfortunately this will not actually work.  Due to a &lt;a href=&quot;https://doc.rust-lang.org/nomicon/lifetime-mismatch.html&quot;&gt;limitation in Rust’s
syntax&lt;/a&gt;, any call to
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;alloc()&lt;/code&gt; will mutably borrow &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SyncBump&lt;/code&gt; for the lifetime of the returned
object.  The net effect is that &lt;a href=&quot;https://godbolt.org/z/YYvv8cTzf&quot;&gt;you can only allocate a single object from the
arena at a time&lt;/a&gt;, a limitation so severe that
it makes &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SyncBump&lt;/code&gt; useless.&lt;/p&gt;

&lt;p&gt;This means that, for the time being, interior mutability (a la
&lt;a href=&quot;https://github.com/fitzgen/bumpalo&quot;&gt;Bumpalo&lt;/a&gt;) is the only feasible way to
implement a non-thread-safe arena in Rust.&lt;/p&gt;

&lt;h1 id=&quot;ergonomics-of-arenas-in-rust&quot;&gt;Ergonomics of Arenas in Rust&lt;/h1&gt;

&lt;p&gt;What’s it like to use Arenas in Rust?  Let’s take a simple struct that doesn’t
use arenas:&lt;/p&gt;

&lt;div class=&quot;language-rust highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Foo&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;vec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;Vec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;foo&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Foo&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;ABC&quot;&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.to_string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;vec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;Vec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If we want to put the string/vec data on an arena instead, it looks like this:&lt;/p&gt;

&lt;div class=&quot;language-rust highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;bumpalo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nn&quot;&gt;collections&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BumpString&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;bumpalo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nn&quot;&gt;collections&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;Vec&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BumpVec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;bumpalo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Bump&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ArenaFoo&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&apos;a&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BumpString&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&apos;a&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;vec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BumpVec&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&apos;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;i32&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bump&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;Bump&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;foo&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ArenaFoo&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;BumpString&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;from_str_in&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;ABC&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bump&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;vec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;BumpVec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new_in&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bump&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We have to use different, arena-aware variations of containers like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;String&lt;/code&gt;
and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Vec&lt;/code&gt;. This is intended to be a temporary situation which will eventually
be remedied by &lt;a href=&quot;https://github.com/rust-lang/rust/issues/42774&quot;&gt;adding allocator support to the standard
containers&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;More interesting and more fundamental is that our struct now has a lifetime
parameter &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&apos;a&lt;/code&gt; that constrains the struct in a way that the original struct was
not constrained.  This makes sense, as it expresses the fact that our struct
&lt;em&gt;must&lt;/em&gt; be outlived by the arena &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bump&lt;/code&gt;.  Rust will require that we propagate
this lifetime parameter to any other struct that contains &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ArenaFoo&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In my experience trying this out on a hobby project PDF parser, this all worked
out reasonably well.  I had to put lifetime parameters on most of my types, but
it didn’t really cause a problem in my testing.&lt;/p&gt;

&lt;p&gt;Having the compiler automatically check the lifetimes was very satisfying; it
was a level of static safety checking that I have never gotten to experience in
C or C++.&lt;/p&gt;

&lt;h2 id=&quot;hiding-the-lifetime-parameter&quot;&gt;Hiding the lifetime parameter&lt;/h2&gt;

&lt;p&gt;Things changed once I tried to expose my parser to Python through the
&lt;a href=&quot;https://pyo3.rs/&quot;&gt;PyO3&lt;/a&gt; Python bindings for Rust.  A Rust struct exposed as a
Python class through
&lt;a href=&quot;https://pyo3.rs/master/class.html#defining-a-new-class&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;#[pyclass]&lt;/code&gt;&lt;/a&gt; &lt;a href=&quot;https://github.com/PyO3/pyo3/issues/1088&quot;&gt;cannot
have type parameters&lt;/a&gt;.  Here is my
attempt at a basic “Hello, World” of exposing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ArenaFoo&lt;/code&gt; to Python:&lt;/p&gt;

&lt;div class=&quot;language-rust highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pyo3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nn&quot;&gt;prelude&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;bumpalo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nn&quot;&gt;collections&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BumpString&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;bumpalo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nn&quot;&gt;collections&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;Vec&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BumpVec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;bumpalo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Bump&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;nd&quot;&gt;#[pyclass]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ArenaFoo&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&apos;a&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BumpString&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&apos;a&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;vec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BumpVec&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&apos;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;i32&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;nd&quot;&gt;#[pymodule]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;arena_test&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_py&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Python&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;m&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PyModule&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PyResult&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;Ok&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(())&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The compiler rejects this:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;error: #[pyclass] cannot have generic parameters: arena_test
 --&amp;gt; src/lib.rs:9:11
  |
9 | struct ArenaFoo&amp;lt;&apos;a&amp;gt; {
  |           ^
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This naturally led me to brainstorm how I could put my &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ArenaFoo&amp;lt;&apos;a&amp;gt;&lt;/code&gt; inside of
a struct with no lifetime constraints.  This question is not specific to my
problem of trying to bind to Python, it is a more general question: is it
possible for a type to use arenas internally, in a way that is invisible to the
user?  Can we hide the lifetime parameter, so that our usage of arenas can be a
purely internal concern that does not impose any extra constraints on users of
the type?&lt;/p&gt;

&lt;p&gt;My first thought was to bundle the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Bump&lt;/code&gt; inside of the same struct, thereby
guaranteeing that it will have the same lifetime:&lt;/p&gt;

&lt;div class=&quot;language-rust highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nd&quot;&gt;#[pyclass]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ArenaFoo&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;bump&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Bump&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BumpString&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&apos;a&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;vec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BumpVec&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&apos;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;i32&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;By itself this does nothing to convince the compiler that the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&apos;a&lt;/code&gt; is
unnecessary on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ArenaFoo&amp;lt;&apos;a&amp;gt;&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I asked about this on Rust’s Discord, and I was informed that this pattern is
known as the “self-referential struct”, and it is notorously difficult to make
sound.  The problem is that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;str&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vec&lt;/code&gt; would have references to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bump&lt;/code&gt; in
the steady state.  Rust’s normal rules would allow anyone with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;amp;mut ArenaFoo&lt;/code&gt;
to get &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;amp;mut Bump&lt;/code&gt;, but this would be unsound if other members of the struct
have &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;amp;Bump&lt;/code&gt; references.&lt;/p&gt;

&lt;p&gt;I was referred to the &lt;a href=&quot;https://docs.rs/ouroboros/latest/ouroboros/&quot;&gt;ouroboros&lt;/a&gt;
and &lt;a href=&quot;https://kimundi.github.io/owning-ref-rs/owning_ref/index.html&quot;&gt;owning_ref&lt;/a&gt;
crates, which can help users construct self-referencing structs in a reasonably
sound way (I was told that both have soundness holes, but that these are
apparently fixable).&lt;/p&gt;

&lt;h2 id=&quot;using-ouroboros-for-self-referencing-structs&quot;&gt;Using ouroboros for self-referencing structs&lt;/h2&gt;

&lt;p&gt;With ouroboros, things were looking promising:&lt;/p&gt;

&lt;div class=&quot;language-rust highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pyo3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nn&quot;&gt;prelude&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;bumpalo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nn&quot;&gt;collections&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BumpString&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;bumpalo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nn&quot;&gt;collections&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;Vec&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BumpVec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;bumpalo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Bump&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;ouroboros&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self_referencing&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;nd&quot;&gt;#[pyclass]&lt;/span&gt;
&lt;span class=&quot;nd&quot;&gt;#[self_referencing]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ArenaFoo&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;bump&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Bump&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;nd&quot;&gt;#[covariant]&lt;/span&gt;
    &lt;span class=&quot;nd&quot;&gt;#[borrows(bump)]&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BumpString&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&apos;this&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;nd&quot;&gt;#[covariant]&lt;/span&gt;
    &lt;span class=&quot;nd&quot;&gt;#[borrows(bump)]&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;vec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BumpVec&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&apos;this&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;i32&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;nd&quot;&gt;#[pymodule]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;arena_test&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_py&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Python&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;m&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PyModule&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PyResult&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;Ok&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(())&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;However the Rust compiler threw errors:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;error[E0277]: `NonNull&amp;lt;i32&amp;gt;` cannot be sent between threads safely
   --&amp;gt; src/lib.rs:8:1
    |
8   | #[pyclass]
    | ^^^^^^^^^^ `NonNull&amp;lt;i32&amp;gt;` cannot be sent between threads safely
    |
   ::: /Users/haberman/.cargo/registry/src/github.com-1ecc6299db9ec823/pyo3-0.14.5/src/class/impl_.rs:328:33
    |
328 | pub struct ThreadCheckerStub&amp;lt;T: Send&amp;gt;(PhantomData&amp;lt;T&amp;gt;);
    |                                 ---- required by this bound in `ThreadCheckerStub`
    |
    = help: within `SelfReferencingFoo`, the trait `Send` is not implemented for `NonNull&amp;lt;i32&amp;gt;`
    = note: required because it appears within the type `bumpalo::collections::raw_vec::RawVec&amp;lt;&apos;static, i32&amp;gt;`
    = note: required because it appears within the type `bumpalo::collections::Vec&amp;lt;&apos;static, i32&amp;gt;`
note: required because it appears within the type `SelfReferencingFoo`
   --&amp;gt; src/lib.rs:10:8
    |
10  | struct SelfReferencingFoo {
    |        ^^^^^^^^^^^^^^^^^^
    = note: this error originates in the attribute macro `pyclass` (in Nightly builds, run with -Z macro-backtrace for more info)


error[E0277]: `Cell&amp;lt;NonNull&amp;lt;bumpalo::ChunkFooter&amp;gt;&amp;gt;` cannot be shared between threads safely
   --&amp;gt; src/lib.rs:8:1
    |
8   | #[pyclass]
    | ^^^^^^^^^^ `Cell&amp;lt;NonNull&amp;lt;bumpalo::ChunkFooter&amp;gt;&amp;gt;` cannot be shared between threads safely
    |
   ::: /Users/haberman/.cargo/registry/src/github.com-1ecc6299db9ec823/pyo3-0.14.5/src/class/impl_.rs:328:33
    |
328 | pub struct ThreadCheckerStub&amp;lt;T: Send&amp;gt;(PhantomData&amp;lt;T&amp;gt;);
    |                                 ---- required by this bound in `ThreadCheckerStub`
    |
    = help: within `bumpalo::Bump`, the trait `Sync` is not implemented for `Cell&amp;lt;NonNull&amp;lt;bumpalo::ChunkFooter&amp;gt;&amp;gt;`
    = note: required because it appears within the type `bumpalo::Bump`
    = note: required because of the requirements on the impl of `Send` for `&amp;amp;&apos;static bumpalo::Bump`
    = note: required because it appears within the type `bumpalo::collections::raw_vec::RawVec&amp;lt;&apos;static, i32&amp;gt;`
    = note: required because it appears within the type `bumpalo::collections::Vec&amp;lt;&apos;static, i32&amp;gt;`
note: required because it appears within the type `SelfReferencingFoo`
   --&amp;gt; src/lib.rs:10:8
    |
10  | struct SelfReferencingFoo {
    |        ^^^^^^^^^^^^^^^^^^
    = note: this error originates in the attribute macro `pyclass` (in Nightly builds, run with -Z macro-backtrace for more info)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The main problem here is that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Bump&lt;/code&gt; is not &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Sync&lt;/code&gt;.  This unfortunately
prevents &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SelfReferencingFoo&lt;/code&gt; from being &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Send&lt;/code&gt;; a significant limitation.&lt;/p&gt;

&lt;p&gt;For the PyO3 case, while PyO3 does allow you to specify
&lt;a href=&quot;https://pyo3.rs/master/class.html#customizing-the-class&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;#[pyclass(unsendable)]&lt;/code&gt;&lt;/a&gt;
to indicate a class that doesn’t support &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Send&lt;/code&gt;, this will cause a runtime
panic if a Python user accesses the object from a different Python thread
than the one that created it.  This might be ok for a library that is just
for experimentation, but it would not be an acceptable limitation for a
production-quality Python library.&lt;/p&gt;

&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;

&lt;p&gt;Rust’s lifetime system offers the very attractive possibility
of having the type system automatically check lifetimes of arena-allocated
objects.  And indeed, this lifetime checking worked great in the scenarios
where I was able to use it.&lt;/p&gt;

&lt;p&gt;I was able to use ouroboros to make my usage of arenas an internal concern of
my type.  A self-referencing struct can allow the arena to be packaged together
with the references, such that we do not need to impose a lifetime parameter on
users of the struct.&lt;/p&gt;

&lt;p&gt;I was unfortunately not able to find a satisfactory solution to using arenas in
Rust while supporting &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Send&lt;/code&gt;, which hampered my ability to wrap my library in
Python.  The combination of (1) &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;!Sync&lt;/code&gt; on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Bump&lt;/code&gt; and (2) arena-aware
containers that store &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;amp;Bump&lt;/code&gt; resulted in types that are not &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Send&lt;/code&gt;, which for
my case was too big a limitation to move forward with.&lt;/p&gt;

&lt;p&gt;This led me to conclude that, for the time being, we will need to avoid storing
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Bump&amp;amp;&lt;/code&gt; in any structure if we need it to support &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Send&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The only other idea that came to mind was to make &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Bump&lt;/code&gt; truly thread-safe.
Then it could support allocation through &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;amp;self&lt;/code&gt; but also support &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Sync&lt;/code&gt;.
Protobuf’s C++ thread-safe allocator has been optimized extensively to keep the
synchronization overhead low: a thread-local is used as a cache to quickly
resolve to a structure specific to that arena and thread, from which
allocations can happen without synchronization.  There is definitely some
significant complexity and a bit of overhead to this, but it is one option that
could potentially solve the issue.&lt;/p&gt;

</description>
        <pubDate>Sun, 19 Dec 2021 00:00:00 +0000</pubDate>
        <link>https://blog.reverberate.org/2021/12/19/arenas-and-rust.html</link>
        <guid isPermaLink="true">https://blog.reverberate.org/2021/12/19/arenas-and-rust.html</guid>
        
        
      </item>
    
    
      <item>
        <title>Thread Safety in C++ and Rust</title>
        <description>&lt;p&gt;Lately I’ve been experimenting with Rust, and I want to report some of what
I’ve learned about thread-safety.  I am an enthusiastic dabbler in Rust: I
spend most of my time in C and C++, but I’m always looking for an excuse to
learn more about Rust’s approach to the techniques I use every day in C and
C++.&lt;/p&gt;

&lt;p&gt;When studying Rust’s threading model, I came to see some correspondence between
C++ and Rust terminology that I had not seen published previously.  Here are my
findings, which hopefully can help people with C++ background understand Rust
(or vice-versa).&lt;/p&gt;

&lt;h1 id=&quot;c&quot;&gt;C++&lt;/h1&gt;

&lt;p&gt;The C++ standard does not define the term “thread-safe”, but it is &lt;a href=&quot;https://abseil.io/blog/20180531-regular-types#data-races-and-thread-safety-properties&quot;&gt;common
practice now within the C++
community&lt;/a&gt;
to define it in the following way:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;thread-safe&lt;/strong&gt;: A type is &lt;em&gt;thread-safe&lt;/em&gt; if it is is safe to invoke &lt;strong&gt;any&lt;/strong&gt; of
its methods concurrently.  To provide this guarantee, a type must generally
take some special measures to avoid data races, eg. using a mutex or atomic
operations internally.  This generally comes with performance and/or
complexity costs, so most types will not be thread-safe.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;thread-compatible&lt;/strong&gt;: A type is &lt;em&gt;thread-compatible&lt;/em&gt; if it is safe to invoke
&lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;const&lt;/code&gt;&lt;/strong&gt; methods concurrently.  Any concurrent call to a non-&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;const&lt;/code&gt;
method must be synchronized by the caller.  Most types in C++ are
thread-compatible, as this guarantee comes mostly comes for free: it happens
naturally for any type that is const-correct (ie. avoids &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mutable&lt;/code&gt; members or
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;const_cast&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thread-compatible types compose nicely and avoid synchronization overheads.
Suppose you have 10 thread-compatible objects that you want to access
concurrently together.  You can wrap a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Mutex&lt;/code&gt; around all 10 and pay only a
single synchronization cost.  If you have 10 thread-safe objects, you pay 10
separate synchronization costs as each of them perform their own internal
synchronization.  If you are using an object in only one thread, you may not
need synchronization at all, but the thread-safe type won’t know this and will
pay the cost regardless.  For all of these reasons, thread-compatible types are
generally preferred.&lt;/p&gt;

&lt;p&gt;Here is an example of a thread-safe type in C++.  It achieves thread-safety
by using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;std::atomic&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-c++ highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// C++ Usage of thread-safe type.&lt;/span&gt;
&lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;atomic&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;thread&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;vector&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;iostream&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// A very simple thread-safe type.&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ThreadSafeCounter&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
 &lt;span class=&quot;nl&quot;&gt;public:&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;ThreadSafeCounter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;counter_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{}&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Increment&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;counter_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fetch_add&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GetCount&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;counter_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

 &lt;span class=&quot;nl&quot;&gt;private:&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;atomic_int32_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;counter_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;

&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;ThreadSafeCounter&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;vector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;kr&quot;&gt;thread&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;threads&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

  &lt;span class=&quot;c1&quot;&gt;// Spawn `n` threads that all share a single counter.&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;threads&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;push_back&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;kr&quot;&gt;thread&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;c1&quot;&gt;// Unsynchronized call of a non-const method.&lt;/span&gt;
      &lt;span class=&quot;c1&quot;&gt;// Only safe because the type is thread-safe.&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Increment&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}));&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;thread&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;threads&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;thread&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

  &lt;span class=&quot;c1&quot;&gt;// This will ultimately print `n`.&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cout&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetCount&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h1 id=&quot;rust&quot;&gt;Rust&lt;/h1&gt;

&lt;p&gt;Rust’s model for thread-safety has some notable differences.  Rust’s
thread-safety story centers around two traits:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Sync&lt;/code&gt; trait indicates that a type can be safely shared between threads.&lt;/li&gt;
  &lt;li&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Send&lt;/code&gt; trait indicates that a type can be safely moved between threads.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Sync&lt;/code&gt; trait ends up mapping closely to the C++ concept of
&lt;em&gt;thread-compatible&lt;/em&gt;.  It indicates that concurrent access of a type is safe, as
long as neither of the concurrent operations operates on a mutable reference.
Just as most types in C++ are thread-compatible, most types in Rust are &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Sync&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;However, Rust and C++ differ far more when we talk about thread-safe types.
Rust forbids shared mutable access at the language level.  That means that the
C++ way of modeling thread-safety won’t work at all in Rust.  Even if we tried
to make a Rust type that offered the C++ thread-safety guarantee, safe Rust
code would never be able to take advantage of this guarantee, because the code
would fail to compile.&lt;/p&gt;

&lt;p&gt;For example, let’s try to port the C++ code above to Rust:&lt;/p&gt;

&lt;div class=&quot;language-rust highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;thread&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nn&quot;&gt;sync&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nn&quot;&gt;atomic&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AtomicI32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Ordering&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ThreadSafeCounter&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AtomicI32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;impl&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ThreadSafeCounter&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;increment&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;mut&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;.count&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.fetch_add&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;Ordering&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SeqCst&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;mut&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ThreadSafeCounter&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;AtomicI32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;mut&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;threads&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;Vec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;..&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;threads&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.push&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nn&quot;&gt;thread&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;spawn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;c1&quot;&gt;// Rust won&apos;t allow this.  We are attempting to mutably borrow&lt;/span&gt;
            &lt;span class=&quot;c1&quot;&gt;// the same value multiple times.&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.increment&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;}));&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;thread&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;threads&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;thread&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;nd&quot;&gt;println!&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;{}&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;.count&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nn&quot;&gt;Ordering&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SeqCst&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This fails to compile with:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;error[E0499]: cannot borrow `counter` as mutable more than once at a time
  --&amp;gt; &amp;lt;source&amp;gt;:18:37
   |
18 |           threads.push(thread::spawn( || {
   |                        -              ^^ `counter` was mutably borrowed here in the previous iteration of the loop
   |  ______________________|
   | |
19 | |             // Rust won&apos;t allow this.  We are attempting to mutably borrow
20 | |             // the same value multiple times.
21 | |             counter.increment();
   | |             ------- borrows occur due to use of `counter` in closure
22 | |         }));
   | |__________- argument requires that `counter` is borrowed for `&apos;static`
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Because Rust fundamentally allows only a single mutable reference to any given
object, we have to express the C++ concept of thread-safety in a different way.&lt;/p&gt;

&lt;p&gt;The Rust answer to thread-safety is to allow mutation on an immutable
reference.  Rust calls this “interior mutability.”  With one small change,
the previous example compiles and works as expected:&lt;/p&gt;

&lt;div class=&quot;language-rust highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;thread&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nn&quot;&gt;sync&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nn&quot;&gt;atomic&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AtomicI32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Ordering&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ThreadSafeCounter&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AtomicI32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;impl&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ThreadSafeCounter&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// increment() uses &quot;interior mutability&quot;: it accepts an immutable&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// reference, but ultimately mutates the value.&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;increment&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;.count&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.fetch_add&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;Ordering&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SeqCst&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;pub&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;mut&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ThreadSafeCounter&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;AtomicI32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;mut&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;threads&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;Vec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;..&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;threads&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.push&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nn&quot;&gt;thread&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;spawn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.increment&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;}));&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;thread&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;threads&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;thread&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;nd&quot;&gt;println!&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;{}&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;.count&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nn&quot;&gt;Ordering&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SeqCst&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As a C++ programmer, interior mutability strikes me as a bit of a fib: an
operation that is in fact mutable, both logically and physically, is allowed on
an immutable reference.  This is very similar to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mutable&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;const_cast&lt;/code&gt; in
C++, which are both frowned on.&lt;/p&gt;

&lt;p&gt;I found a nice explanation of the Rust perspective &lt;a href=&quot;https://stackoverflow.com/a/63490856/77070&quot;&gt;in this Stack Overflow
answer&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;In a way, Rust’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mut&lt;/code&gt; keyword actually has two meanings. In a pattern it
means “mutable” and in a reference type it means “exclusive”. The difference
between &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;amp;self&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;amp;mut self&lt;/code&gt; is not really whether self can be mutated or
not, but whether it can be &lt;em&gt;aliased&lt;/em&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This helps explain the rationale behind interior mutability.  When applied to a
reference, “immutable” in Rust doesn’t really mean “immutable”, it means
“non-exclusive.”  This point is covered in more depth in another article,
&lt;a href=&quot;https://docs.rs/dtolnay/0.0.9/dtolnay/macro._02__reference_types.html&quot;&gt;Accurate mental model for Rust’s reference
types&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;There was even &lt;a href=&quot;http://smallcultfollowing.com/babysteps/blog/2014/05/13/focusing-on-ownership/&quot;&gt;a proposal several years
back&lt;/a&gt;
to rename &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;amp;mut&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;amp;my&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;amp;only&lt;/code&gt;, or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;amp;uniq&lt;/code&gt;, to emphasize that the key
property of such references is not that they are mutable, but that they are
&lt;em&gt;unique&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;A type we would consider thread-safe in C++ will need to use interior
mutability in Rust and implement the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Sync&lt;/code&gt; trait. It will need to allow
mutating operations to be performed through an immutable reference.&lt;/p&gt;

&lt;p&gt;This difference is reflected in the API of the atomics we were using
above:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;In C++,
&lt;a href=&quot;https://en.cppreference.com/w/cpp/atomic/atomic/fetch_add&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;std::atomic_int32_t::fetch_add()&lt;/code&gt;&lt;/a&gt;
is a non-&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;const&lt;/code&gt; operation.  This makes sense, as the operation does in fact
mutate the atomic. Callers have to read the documentation to know that
concurrent calls to this non-&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;const&lt;/code&gt; method are safe.&lt;/li&gt;
  &lt;li&gt;In Rust,
&lt;a href=&quot;https://doc.rust-lang.org/std/sync/atomic/struct.AtomicI32.html#method.fetch_add&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;std::sync::atomic::AtomicI32::fetch_add()&lt;/code&gt;&lt;/a&gt;
is an immutable operation (takes a non-&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mut&lt;/code&gt; reference).  This is the
interior mutability “fib” (as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fetch_add()&lt;/code&gt; will in fact mutate the atomic),
but it has the benefit of expressing the the type’s thread-safety guarantee
within the type system, which allows the compiler to automatically check it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Rust has the obvious advantage of having thread-safety modeled within the type
system, and checked by the compiler.  Rust even allows a single type to have
both thread-safe and non-thread-safe methods.  An example is
&lt;a href=&quot;https://doc.rust-lang.org/std/sync/struct.Mutex.html&quot;&gt;std::sync::Mutex&lt;/a&gt;, which
provides both
&lt;a href=&quot;https://doc.rust-lang.org/std/sync/struct.Mutex.html#method.lock&quot;&gt;Mutex::lock(&amp;amp;self)&lt;/a&gt;,
a thread-safe method that locks the mutex, and &lt;a href=&quot;https://doc.rust-lang.org/std/sync/struct.Mutex.html#method.get_mut&quot;&gt;Mutex::get_mut(&amp;amp;mut
self)&lt;/a&gt;,
which allows access to the mutex’s data without any synchronization costs.  If
the caller holds a unique reference, synchronization is unnecessary, and Rust
lets us avoid this overhead.  This is all modeled within the type system and
checked automatically.&lt;/p&gt;

&lt;h1 id=&quot;mapping-between-c-and-rust-terminology&quot;&gt;Mapping between C++ and Rust terminology&lt;/h1&gt;

&lt;p&gt;The analysis above leaves us with the following mapping between terms:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;C++&lt;/th&gt;
      &lt;th&gt;Rust&lt;/th&gt;
      &lt;th&gt;Example&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;thread-compatible&lt;/td&gt;
      &lt;td&gt;implements &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Sync&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;most types (eg. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Vec&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vector&lt;/code&gt;)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;thread-safe&lt;/td&gt;
      &lt;td&gt;implements &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Sync&lt;/code&gt; with interior mutability&lt;/td&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AtomicI32&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;thread-unsafe&lt;/td&gt;
      &lt;td&gt;doesn’t implement &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Sync&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Cell&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RefCell&lt;/code&gt; in Rust&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;We can also put things in quadrants like so:&lt;/p&gt;

&lt;table&gt;
  &lt;tr&gt;
    &lt;th&gt;&lt;/th&gt;
    &lt;th&gt;&lt;code&gt;Sync&lt;/code&gt;&lt;/th&gt;
    &lt;th&gt;&lt;code&gt;!Sync&lt;/code&gt;&lt;/th&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;th&gt;interior mutability&lt;/th&gt;
    &lt;td&gt;thread-safe (&lt;code&gt;AtomicI32&lt;/code&gt;)&lt;/td&gt;
    &lt;td&gt;thread-unsafe (&lt;code&gt;Cell&lt;/code&gt;)&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;th&gt;no interior mutability&lt;/th&gt;
    &lt;td&gt;thread-compatible (&lt;code&gt;Vec&lt;/code&gt;)&lt;/td&gt;
    &lt;td&gt;thread-unsafe (&lt;code&gt;proc_macro&lt;/code&gt;)&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;

&lt;p&gt;The two quadrants under &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;!Sync&lt;/code&gt; are both labeled “thread-unsafe.” The C++
terminology does not distinguish between these two cases, and neither does the
type system.  In Rust there are interesting differences between them.  With
interior mutability come types like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Cell&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RefCell&lt;/code&gt; that can provide safe
interior mutability by constraining the circumstances under which mutation can
occur.  The “no interior mutability” case here initially seems not useful, but
&lt;a href=&quot;https://lobste.rs/s/bmejfu/thread_safety_c_rust#c_weauz9&quot;&gt;as pointed out to me on
lobste.rs&lt;/a&gt;, it can
actually be quite useful when combined with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;!Send&lt;/code&gt;, as it allows one to create
a handle to thread-local data that can only be safely used within a single
thread.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Thanks to Matt Brubeck for reading a draft version of this article.
Matt has &lt;a href=&quot;https://limpet.net/mbrubeck/2019/02/07/rust-a-unique-perspective.html&quot;&gt;an article that delves more deeply into unique vs shared
references&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
</description>
        <pubDate>Sat, 18 Dec 2021 00:00:00 +0000</pubDate>
        <link>https://blog.reverberate.org/2021/12/18/thread-safety-cpp-rust.html</link>
        <guid isPermaLink="true">https://blog.reverberate.org/2021/12/18/thread-safety-cpp-rust.html</guid>
        
        
      </item>
    
    
      <item>
        <title>Parsing Protobuf at 2+GB/s: How I Learned To Love Tail Calls in C</title>
        <description>&lt;p&gt;&lt;em&gt;[Note: there have been several developments in this space since this
article was published.  See &lt;a href=&quot;https://blog.reverberate.org/2025/02/10/tail-call-updates.html&quot;&gt;A Tail Calling Interpreter For Python (And Other Updates)&lt;/a&gt; for the latest information about
this technique.]&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I just landed &lt;a href=&quot;https://reviews.llvm.org/D99517&quot;&gt;an exciting feature in the main branch of the Clang
compiler&lt;/a&gt;. Using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[[clang::musttail]]&lt;/code&gt; or
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;__attribute__((musttail))&lt;/code&gt; statement attributes, you can now get guaranteed
tail calls in C, C++, and Objective-C.&lt;/p&gt;

&lt;iframe width=&quot;800px&quot; height=&quot;200px&quot; src=&quot;https://godbolt.org/e#g:!((g:!((g:!((h:codeEditor,i:(fontScale:14,fontUsePx:&apos;0&apos;,j:1,lang:___c,selection:(endColumn:4,endLineNumber:3,positionColumn:4,positionLineNumber:3,selectionStartColumn:4,selectionStartLineNumber:3,startColumn:4,startLineNumber:3),source:&apos;%0Aint+g(int)%3B%0Aint+f(int+x)+%7B%0A++++__attribute__((musttail))+return+g(x)%3B%0A%7D&apos;),l:&apos;5&apos;,n:&apos;0&apos;,o:&apos;C+source+%231&apos;,t:&apos;0&apos;)),k:50,l:&apos;4&apos;,n:&apos;0&apos;,o:&apos;&apos;,s:0,t:&apos;0&apos;),(g:!((h:compiler,i:(compiler:cclang_trunk,filters:(b:&apos;0&apos;,binary:&apos;1&apos;,commentOnly:&apos;0&apos;,demangle:&apos;0&apos;,directives:&apos;0&apos;,execute:&apos;1&apos;,intel:&apos;0&apos;,libraryCode:&apos;0&apos;,trim:&apos;1&apos;),fontScale:14,fontUsePx:&apos;0&apos;,j:2,lang:___c,libs:!(),options:&apos;-O2&apos;,selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:&apos;5&apos;,n:&apos;0&apos;,o:&apos;x86-64+clang+(trunk)+(Editor+%231,+Compiler+%232)+C&apos;,t:&apos;0&apos;)),k:50,l:&apos;4&apos;,n:&apos;0&apos;,o:&apos;&apos;,s:0,t:&apos;0&apos;)),l:&apos;2&apos;,n:&apos;0&apos;,o:&apos;&apos;,t:&apos;0&apos;)),version:4&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;While tail calls are usually associated with a functional programming style, I
am interested in them purely for performance reasons.  It turns out that in
some cases we can use tail calls to get better code out of the compiler than
would otherwise be possible—at least given current compiler
technology—without dropping to assembly.&lt;/p&gt;

&lt;p&gt;Applying this technique to protobuf parsing has yielded amazing results: &lt;a href=&quot;https://github.com/protocolbuffers/upb/pull/310&quot;&gt;we
have managed to demonstrate protobuf parsing at over
2GB/s&lt;/a&gt;, more than double the
previous state of the art.  There are multiple techniques that contributed to
this speedup, so “tail calls == 2x speedup” is the wrong message to take away.
But tail calls are a key part of what made that speedup possible.&lt;/p&gt;

&lt;p&gt;In this blog entry I will describe why tail calls are such a powerful
technique, how we applied them to protobuf parsing, and how this technique
generalizes to interpreters. I think it’s likely that all of the major language
interpreters written in C (Python, Ruby, PHP, Lua, etc.) could get significant
performance benefits by adopting this technique.  The main downside is
portability: currently &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;musttail&lt;/code&gt; is a nonstandard compiler extension, and
while I hope it catches on it will be a while before it spreads widely enough
that your system’s C compiler is likely to support it.  That said, at build
time you can compromise some efficiency for portability if you detect that
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;musttail&lt;/code&gt; is not available.&lt;/p&gt;

&lt;h1 id=&quot;tail-call-basics&quot;&gt;Tail Call Basics&lt;/h1&gt;

&lt;p&gt;A tail call is any function call that is in tail position, the final action to
be performed before a function returns.  When &lt;em&gt;tail call optimization&lt;/em&gt; occurs,
the compiler emits a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jmp&lt;/code&gt; instruction for the tail call instead of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;call&lt;/code&gt;.
This skips over the bookkeeping that would normally allow the callee &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;g()&lt;/code&gt; to
return back to the caller &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;f()&lt;/code&gt;, like creating a new stack frame or pushing the
return address.  Instead &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;f()&lt;/code&gt; jumps directly to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;g()&lt;/code&gt; as if it were part of
the same function, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;g()&lt;/code&gt; returns directly to whatever function called
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;f()&lt;/code&gt;. This optimization is safe because &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;f()&lt;/code&gt;’s stack frame is no longer
needed once the tail call has begun, since it is no longer possible to access
any of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;f()&lt;/code&gt;’s local variables.&lt;/p&gt;

&lt;p&gt;While this may seem like a run-of-the-mill optimization, it has two very
important properties unlock new possibilities in the kinds of algorithms we can
write.  First, it reduces the stack memory from from ++O(n)++ to ++O(1)++ when
making ++n++ consecutive tail calls, which is important because stack memory is
limited and stack overflow will crash your program.  This means that certain
algorithms are not actually safe to write unless this optimization is
performed.  Secondly, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jmp&lt;/code&gt; eliminates the performance overhead of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;call&lt;/code&gt;, such
that a function call can be just as efficient as any other branch. These two
properties enable us to use tail calls as an efficient alternative to
normal iterative control structures like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;for&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;while&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is by no means a new idea, indeed it goes back to at least 1977 when Guy
Steele wrote &lt;a href=&quot;http://dspace.mit.edu/handle/1721.1/5753&quot;&gt;an entire paper&lt;/a&gt;
arguing that procedure calls make for cleaner designs than &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GOTO&lt;/code&gt;, and that
tail call optimization can make them just as fast. This was one of the &lt;a href=&quot;https://en.wikipedia.org/wiki/History_of_the_Scheme_programming_language#The_Lambda_Papers&quot;&gt;“Lambda
Papers”&lt;/a&gt;
written between 1975 and 1980 that developed many of the ideas underlying Lisp
and Scheme.&lt;/p&gt;

&lt;p&gt;Tail call optimization is not even new to Clang: like GCC and many other
compilers, Clang was already capable of optimizing tail calls. In fact, the
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;musttail&lt;/code&gt; attribute in our first example above did not change the output of
the compiler at all: Clang would already have optimized the tail call under
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-O2&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;What is new is the &lt;em&gt;guarantee&lt;/em&gt;. While compilers will often optimize tail calls
successfully, this is best-effort, not something you can rely on. In
particular, the optimization will most likely not happen in non-optimized
builds:&lt;/p&gt;

&lt;iframe width=&quot;800px&quot; height=&quot;300px&quot; src=&quot;https://godbolt.org/e#g:!((g:!((g:!((h:codeEditor,i:(fontScale:14,fontUsePx:&apos;0&apos;,j:1,lang:___c,source:&apos;%0Avoid+g(int)%3B%0Avoid+f(int+x)+%7B%0A++++return+g(x)%3B%0A%7D&apos;),l:&apos;5&apos;,n:&apos;0&apos;,o:&apos;C+source+%231&apos;,t:&apos;0&apos;)),k:50,l:&apos;4&apos;,n:&apos;0&apos;,o:&apos;&apos;,s:0,t:&apos;0&apos;),(g:!((h:compiler,i:(compiler:cclang_trunk,filters:(b:&apos;0&apos;,binary:&apos;1&apos;,commentOnly:&apos;0&apos;,demangle:&apos;0&apos;,directives:&apos;0&apos;,execute:&apos;1&apos;,intel:&apos;0&apos;,libraryCode:&apos;0&apos;,trim:&apos;1&apos;),fontScale:14,fontUsePx:&apos;0&apos;,j:2,lang:___c,libs:!(),options:&apos;&apos;,source:1),l:&apos;5&apos;,n:&apos;0&apos;,o:&apos;x86-64+clang+(trunk)+(Editor+%231,+Compiler+%232)+C&apos;,t:&apos;0&apos;)),k:50,l:&apos;4&apos;,n:&apos;0&apos;,o:&apos;&apos;,s:0,t:&apos;0&apos;)),l:&apos;2&apos;,n:&apos;0&apos;,o:&apos;&apos;,t:&apos;0&apos;)),version:4&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;Here the tail call was compiled to an actual &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;call&lt;/code&gt;, so we are back to ++O(n)++
stack space. This is why we need &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;musttail&lt;/code&gt;: unless we can get a guarantee from
the compiler that our tail calls will &lt;em&gt;always&lt;/em&gt; be optimized, in all build
modes, it isn’t safe to write algorithms that use tail calls for iteration.  It
would be a pretty severe limitation to have code that only works when
optimizations are enabled.&lt;/p&gt;

&lt;h1 id=&quot;the-trouble-with-interpreter-loops&quot;&gt;The Trouble With Interpreter Loops&lt;/h1&gt;

&lt;p&gt;Compilers are incredible pieces of technology, but they are not perfect.  Mike
Pall, author of LuaJIT, decided to write LuaJIT 2.x’s interpreter in assembly
rather than C, and he cites this decision as a major factor that explains &lt;a href=&quot;https://www.reddit.com/r/programming/comments/badl2/luajit_2_beta_3_is_out_support_both_x32_x64/c0lrus0/&quot;&gt;why
LuaJIT’s interpreter is so
fast&lt;/a&gt;.
He later went into more detail about &lt;a href=&quot;http://lua-users.org/lists/lua-l/2011-02/msg00742.html&quot;&gt;why C compilers struggle with interpreter
main loops&lt;/a&gt;.  His two
most central points are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The larger a function is, and the more complex and connected its control
flow, the harder it is for the compiler’s register allocator to keep the most
important data in registers.&lt;/li&gt;
  &lt;li&gt;When fast paths and slow paths are intermixed in the same function, the
presence of the slow paths compromises the code quality of the fast paths.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These observations closely mirror our experiences optimizing protobuf parsing.
The good news is that tail calls can help solve both of these problems.&lt;/p&gt;

&lt;p&gt;It may seem odd to compare interpreter loops to protobuf parsers, but the
nature of the protobuf wire format makes them more similar than you might
expect.  The protobuf wire format is a series of tag/value pairs, where the tag
contains a field number and wire type.  This tag acts similarly to an
interpreter opcode: it tells us what operation we need to perform to parse this
field’s data.  Like interpreter opcodes, protobuf field numbers can come in any
order, so we have to be prepared to dispatch to any part of the code at any
time.&lt;/p&gt;

&lt;p&gt;The natural way to write such a parser is to have a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;while&lt;/code&gt; loop surrounding a
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;switch&lt;/code&gt; statement, and indeed this has been the state of the art in protobuf
parsing for basically as long as protobufs have existed. For example, &lt;a href=&quot;https://github.com/protocolbuffers/protobuf/blob/f763a2a86084371fd0da95f3eeb879c2ff26b06d/src/google/protobuf/descriptor.pb.cc#L2175-L2227&quot;&gt;here is
some parsing code from the current C++ version of protobuf&lt;/a&gt;.  If we represent the
control flow graphically, we get something like this:&lt;/p&gt;

&lt;center&gt;
&lt;img src=&quot;https://docs.google.com/drawings/d/e/2PACX-1vRKuMNm6Tw_Sn2xJxBDQMHT0u0osp9DUA0Ldr-MhMVp63GPinzYB9JT0qRY9HTymmsesomq3aZe7QEs/pub?w=740&quot; /&gt;
&lt;/center&gt;

&lt;p&gt;But this is incomplete, because at almost every stage there are things that
can go wrong.  The wire type could be wrong, or we could see some corrupt
data, or we could just hit the end of the current buffer.  So the full
control flow graph looks more like this.&lt;/p&gt;

&lt;center&gt;
&lt;img src=&quot;https://docs.google.com/drawings/d/e/2PACX-1vTGgQuAThUGv9ejI_pjujfKRM8rgo7c5b8lP7uveSkJTkJMbZDrtDbJzmRA4HOoDozwjj9WWPlDu8JX/pub?w=740&quot; /&gt;
&lt;/center&gt;

&lt;p&gt;We want to stay on the fast paths (in blue) as much as possible, but when we
hit a hard case we have to execute some fallback code to handle it.  These
fallback paths are usually bigger and more complicated than the fast paths,
touch more data, and often even make out-of-line calls to other functions to
handle the more complex cases.&lt;/p&gt;

&lt;p&gt;Theoretically, this control flow graph paired with a profile should give the
compiler all of the information it needs to generate the most optimal code.  In
practice, when a function is this big and connected, we often find ourselves
fighting the compiler.  It spills an important variable when we want it to keep
it in a register.  It hoists stack frame manipulation that we want to shrink
wrap around a fallback function invocation.  It merges identical code paths
that we wanted to keep separate for branch prediction reasons.  The experience
can end up feeling like trying to play the piano while wearing mittens.&lt;/p&gt;

&lt;h1 id=&quot;improving-interpreter-loops-with-tail-calls&quot;&gt;Improving Interpreter Loops With Tail Calls&lt;/h1&gt;

&lt;p&gt;The analysis above is mainly just a rehash of of Mike’s &lt;a href=&quot;http://lua-users.org/lists/lua-l/2011-02/msg00742.html&quot;&gt;observations about
interpreter main
loops&lt;/a&gt;.  But instead of
dropping to assembly, as Mike did with LuaJIT 2.x, we found that a tail call
oriented design could give us the control we needed to get nearly optimal code
from C.  I worked on this together with my colleague Gerben Stavenga, who came
up with much of the design.  Our approach is similar to the design of the
&lt;a href=&quot;https://github.com/wasm3/wasm3&quot;&gt;wasm3 WebAssembly interpreter&lt;/a&gt; which describes
this pattern as a &lt;a href=&quot;https://github.com/wasm3/wasm3/blob/main/docs/Interpreter.md#m3-massey-meta-machine&quot;&gt;“meta
machine”&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The code for our 2+GB/s protobuf parser was submitted to
&lt;a href=&quot;https://github.com/protocolbuffers/upb&quot;&gt;upb&lt;/a&gt;, a small protobuf library written
in C, in &lt;a href=&quot;https://github.com/protocolbuffers/upb/pull/310&quot;&gt;pull/310&lt;/a&gt;.  While it
is fully working and passing all protobuf conformance tests, it is not rolled
out anywhere yet, and the design has not been implemented in the C++ version of
protobuf.  But now that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;musttail&lt;/code&gt; is available in Clang (and &lt;a href=&quot;https://github.com/protocolbuffers/upb/pull/390&quot;&gt;upb has been
updated to use it&lt;/a&gt;), one of
the biggest barriers to fully productionizing the fast parser has been removed.&lt;/p&gt;

&lt;p&gt;Our design does away with a single big parse function and instead gives each
operation its own small function.  Each function tail calls the next operation
in sequence.  For example here is a function to parse a single fixed-width
field.  (This code is simplified from the actual code in upb; there are many
details of our design that I am leaving out of this article, but will hopefully
cover in future articles).&lt;/p&gt;

&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;stdint.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;stddef.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;string.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;typedef&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;upb_msg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upb_decstate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;typedef&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upb_decstate&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upb_decstate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// The standard set of arguments passed to each parsing function.&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Thanks to x86-64 calling conventions, these will be passed in registers.&lt;/span&gt;
&lt;span class=&quot;cp&quot;&gt;#define UPB_PARSE_PARAMS                                          \
  upb_decstate *d, const char *ptr, upb_msg *msg, intptr_t table, \
      uint64_t hasbits, uint64_t data
#define UPB_PARSE_ARGS d, ptr, msg, table, hasbits, data
&lt;/span&gt;
&lt;span class=&quot;cp&quot;&gt;#define UNLIKELY(x) __builtin_expect(x, 0)
#define MUSTTAIL __attribute__((musttail))
&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;fallback&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;UPB_PARSE_PARAMS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;dispatch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;UPB_PARSE_PARAMS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Code to parse a 4-byte fixed field that uses a 1-byte tag (field 1-15).&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;upb_pf32_1bt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;UPB_PARSE_PARAMS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// Decode &quot;data&quot;, which contains information about this field.&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;uint8_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hasbit_index&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;24&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;size_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ofs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;48&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;UNLIKELY&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0xff&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// Wire type mismatch (the dispatch function xor&apos;s the expected wire type&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// with the actual wire type, so data &amp;amp; 0xff == 0 indicates a match).&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;MUSTTAIL&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fallback&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;UPB_PARSE_ARGS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

  &lt;span class=&quot;n&quot;&gt;ptr&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Advance past tag.&lt;/span&gt;

  &lt;span class=&quot;c1&quot;&gt;// Store data to message.&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;hasbits&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1ull&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hasbit_index&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;memcpy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;msg&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ofs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

  &lt;span class=&quot;n&quot;&gt;ptr&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Advance past data.&lt;/span&gt;

  &lt;span class=&quot;c1&quot;&gt;// Call dispatch function, which will read the next tag and branch to the&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// correct field parser function.&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;MUSTTAIL&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dispatch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;UPB_PARSE_ARGS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For a function this small and simple, Clang gives us code that is
basically impossible to beat.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-asm&quot;&gt;upb_pf32_1bt:                           # @upb_pf32_1bt
        mov     rax, r9
        shr     rax, 24
        bts     r8, rax
        test    r9b, r9b
        jne     .LBB0_1
        mov     r10, r9
        shr     r10, 48
        mov     eax, dword ptr [rsi + 1]
        mov     dword ptr [rdx + r10], eax
        add     rsi, 5
        jmp     dispatch                        # TAILCALL
.LBB0_1:
        jmp     fallback                        # TAILCALL
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Note that there is no prologue or epilogue, no register spills, indeed there is
no usage of the stack whatsoever.  The only exits are &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jmp&lt;/code&gt;s from the two tail
calls, but no code is required to forward the parameters, because the arguments
are already sitting in the correct registers.  Pretty much the only improvement
we could hope for is to get a conditional jump for the tail call, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jne
fallback&lt;/code&gt;, instead of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jne&lt;/code&gt; followed by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jmp&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If you were looking at a disassembly of this code without symbol information,
you would have no reason to know that this was an entire function. It could
just as easily be a basic block from a larger function.  And that, in essence,
is exactly what we are doing.  We are taking an interpreter loop that is
conceptually a big complicated function and programming it block by block,
transferring control flow from one to the next via tail calls.  We have full
control of the register allocation at every block boundary (well, for six
registers at least), and by using the same set of parameters for each function,
we eliminate the need to shuffle any values around from one call to the next.
As long as the function is simple enough to not spill those six registers,
we’ve achieved our goal of keeping our most important state in registers
throughout all of the fast paths.&lt;/p&gt;

&lt;p&gt;We can optimize every instruction sequence independently, and crucially, the
compiler will treat each sequence as independent too because they are in
separate functions (we can prevent inlining with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;noinline&lt;/code&gt; if necessary).
This solves the problem we described earlier where the code from fallback paths
would degrade the code quality for fast paths. If we put the slow paths in
entirely separate functions from the fast paths, we can be guaranteed that
the fast paths will not suffer.  The nice assembly sequence we see above is
effectively frozen, unaffected by any changes we make to other parts of the
parser.&lt;/p&gt;

&lt;p&gt;If we apply this pattern to &lt;a href=&quot;https://www.reddit.com/r/programming/comments/badl2/luajit_2_beta_3_is_out_support_both_x32_x64/c0lrus0/&quot;&gt;Mike’s example from
LuaJIT&lt;/a&gt;,
we can more or less &lt;a href=&quot;https://godbolt.org/z/fd9dv7GGd&quot;&gt;match his hand-written assembly with only minor code
quality defects&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cp&quot;&gt;#define PARAMS unsigned RA, void *table, unsigned inst, \
               int *op_p, double *consts, double *regs
#define ARGS RA, table, inst, op_p, consts, regs
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;typedef&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;op_func&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PARAMS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;fallback&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PARAMS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;cp&quot;&gt;#define UNLIKELY(x) __builtin_expect(x, 0)
#define MUSTTAIL __attribute__((musttail))
&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;ADDVN&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PARAMS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;op_func&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;op_table&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;kt&quot;&gt;unsigned&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RC&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inst&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0xff&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;kt&quot;&gt;unsigned&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RB&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inst&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0xff&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;kt&quot;&gt;unsigned&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;memcpy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;regs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;RB&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;UNLIKELY&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;type&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;13&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fallback&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ARGS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;regs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;RA&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;consts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;RC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;inst&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;op_p&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;kt&quot;&gt;unsigned&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;op&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inst&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0xff&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;RA&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inst&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0xff&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;inst&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;MUSTTAIL&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;op_table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;op&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;](&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ARGS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The resulting assembly is:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-asm&quot;&gt;ADDVN:                                  # @ADDVN
        movzx   eax, dh
        cmp     dword ptr [r9 + 8*rax + 4], -12
        jae     .LBB0_1
        movzx   eax, dl
        movsd   xmm0, qword ptr [r8 + 8*rax]    # xmm0 = mem[0],zero
        mov     eax, edi
        addsd   xmm0, qword ptr [r9 + 8*rax]
        movsd   qword ptr [r9 + 8*rax], xmm0
        mov     edx, dword ptr [rcx]
        add     rcx, 4
        movzx   eax, dl
        movzx   edi, dh
        shr     edx, 16
        mov     rax, qword ptr [rsi + 8*rax]
        jmp     rax                             # TAILCALL
.LBB0_1:
        jmp     fallback
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The only opportunity for improvement I see here, aside from the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jne fallback&lt;/code&gt;
issue mentioned before, is that for some reason the compiler doesn’t want to
generate &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jmp qword ptr [rsi + 8*rax]&lt;/code&gt;. Instead it prefers to load into &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rax&lt;/code&gt;
and then follow with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jmp rax&lt;/code&gt;.  These are minor code generation issues that
could hopefully be fixed in Clang without too much work.&lt;/p&gt;

&lt;h1 id=&quot;limitations&quot;&gt;Limitations&lt;/h1&gt;

&lt;p&gt;One of the biggest caveats with this approach is that these beautiful assembly
sequences get catastrophically pessimized if any non tail calls are present.
Any non tail call forces a stack frame to be created, and a lot of data spills
to the stack.&lt;/p&gt;

&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cp&quot;&gt;#define PARAMS unsigned RA, void *table, unsigned inst, \
               int *op_p, double *consts, double *regs
#define ARGS RA, table, inst, op_p, consts, regs
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;typedef&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;op_func&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PARAMS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;fallback&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PARAMS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;cp&quot;&gt;#define UNLIKELY(x) __builtin_expect(x, 0)
#define MUSTTAIL __attribute__((musttail))
&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;ADDVN&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PARAMS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;op_func&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;op_table&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;kt&quot;&gt;unsigned&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RC&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inst&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0xff&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;kt&quot;&gt;unsigned&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RB&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inst&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0xff&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;kt&quot;&gt;unsigned&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;memcpy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;regs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;RB&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;UNLIKELY&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;type&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;13&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;// When we leave off &quot;return&quot;, things get real bad.&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;fallback&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ARGS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;regs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;RA&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;consts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;RC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;inst&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;op_p&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;kt&quot;&gt;unsigned&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;op&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inst&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0xff&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;RA&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inst&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0xff&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;inst&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;MUSTTAIL&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;op_table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;op&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;](&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ARGS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This leads to the very unfortunate:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-asm&quot;&gt;ADDVN:                                  # @ADDVN
        push    rbp
        push    r15
        push    r14
        push    r13
        push    r12
        push    rbx
        push    rax
        mov     r15, r9
        mov     r14, r8
        mov     rbx, rcx
        mov     r12, rsi
        mov     ebp, edi
        movzx   eax, dh
        cmp     dword ptr [r9 + 8*rax + 4], -12
        jae     .LBB0_1
.LBB0_2:
        movzx   eax, dl
        movsd   xmm0, qword ptr [r14 + 8*rax]   # xmm0 = mem[0],zero
        mov     eax, ebp
        addsd   xmm0, qword ptr [r15 + 8*rax]
        movsd   qword ptr [r15 + 8*rax], xmm0
        mov     edx, dword ptr [rbx]
        add     rbx, 4
        movzx   eax, dl
        movzx   edi, dh
        shr     edx, 16
        mov     rax, qword ptr [r12 + 8*rax]
        mov     rsi, r12
        mov     rcx, rbx
        mov     r8, r14
        mov     r9, r15
        add     rsp, 8
        pop     rbx
        pop     r12
        pop     r13
        pop     r14
        pop     r15
        pop     rbp
        jmp     rax                             # TAILCALL
.LBB0_1:
        mov     edi, ebp
        mov     rsi, r12
        mov     r13d, edx
        mov     rcx, rbx
        mov     r8, r14
        mov     r9, r15
        call    fallback
        mov     edx, r13d
        jmp     .LBB0_2
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To avoid this, we tried to follow a discipline of only calling other functions
via inlining or tail calls.  This can get annoying if an operation has multiple
points at which an unusual case can occur that is not an error.  For example,
when we are parsing protobufs, the fast and common case is that varints are
only one byte long, but longer varints are not an error.  Handling the unusual
case inline can compromise the quality of the fast path if the fallback code is
too complicated.  But tail calling to a fallback function gives no way of
easily resuming the operation once the unusual case is handled, so the fallback
function must be capable of pushing forward and completing the operation.  This
leads to code duplication and complexity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;[Update 2025-01-27]&lt;/strong&gt;: Since this article was written, more solutions became
available for solving this problem through the use of different calling
conventions:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://clang.llvm.org/docs/AttributeReference.html#preserve-most&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;__attribute__((preserve_most))&lt;/code&gt;&lt;/a&gt;
is a calling convention we can use on our fallback functions that we perform
non-tail calls to.  The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;preserve_most&lt;/code&gt; attribute makes the callee
responsible for preserving nearly all registers, which moves the cost of the
register spills to the fallback functions where we want it.  There was
previously a Clang bug that led to crashes when using this attribute, but it
was &lt;a href=&quot;https://reviews.llvm.org/D141020&quot;&gt;fixed in
2023&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://clang.llvm.org/docs/AttributeReference.html#preserve-none&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;__attribute__((preserve_none))&lt;/code&gt;&lt;/a&gt;
is a calling convention we can use on our tail calling functions, which frees
them of the burden of preserving any registers.  It also uses more registers
for arguments, and uses the traditional callee-save registers first, which
leads to fewer register shuffles when calling fallback functions that use
the regular calling convention.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Between the two of these, I think &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;preserve_none&lt;/code&gt; is a better option.  It’s
less intrusive, as it only needs to be added to the interpreter functions
themselves, rather than every function they might call (good luck changing the
calling convention on standard library functions like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;strlen()&lt;/code&gt;).  So overall,
I would recommend it over &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;preserve_most&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The other major limitation is that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;musttail&lt;/code&gt; is not portable.  I very much
hope that the attribute will catch on, spreading to GCC, Visual C++, and other
popular compilers, and even get standardized someday.  But that day is far off,
so what to do in the meantime?&lt;/p&gt;

&lt;p&gt;When &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;musttail&lt;/code&gt; is not available, we need to perform at least one true
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;return&lt;/code&gt;, without a tail call, for every conceptual loop iteration.  We have
not yet implemented this fallback in upb, but I expect it will involve a macro
that either tail calls to dispatch or just returns, based on the availability
of musttail.&lt;/p&gt;
</description>
        <pubDate>Wed, 21 Apr 2021 00:00:00 +0000</pubDate>
        <link>https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html</link>
        <guid isPermaLink="true">https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html</guid>
        
        
      </item>
    
    
      <item>
        <title>Hoare’s Rebuttal and Bubble Sort’s Comeback</title>
        <description>&lt;p&gt;&lt;em&gt;&lt;strong&gt;Editor’s note&lt;/strong&gt;: For this blog entry I welcome my friend and colleague Gerben Stavenga
as a guest author.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Recently Andrei Alexandrescu published an interesting
&lt;a href=&quot;https://dlang.org/blog/2020/05/14/lomutos-comeback/&quot;&gt;post&lt;/a&gt; about optimizing
QuickSort using the Lomuto partition scheme. The essence of that post is that
for many situations the performance of QuickSort is completely dominated by
branch mispredicts and that a big speed up can be achieved by writing
branchless code. This has been observed by many, and various branchless
sorting routines have been proposed. Andrei observed that from the two well
known QuickSort partitioning schemes Lomuto is easily implemented branchless,
and this indeed performs much better for sorting small primitives. I recently
experimented with similar ideas but took them in a different but interesting
direction. I discovered that a hybrid of the Hoare and Lomuto schemes can 
deliver a large improvement even compared with branchless Lomuto. And the 
final surprise is that Bubble Sort takes the crown for small arrays. The key
to all these wins is exploiting instruction-level parallelism and reducing 
dependency chains.&lt;/p&gt;

&lt;h1 id=&quot;basic-quicksort-fundamentals&quot;&gt;Basic QuickSort fundamentals&lt;/h1&gt;

&lt;p&gt;Quicksort refers to a class of algorithms for sorting an array that all share the same outline&lt;/p&gt;

&lt;div class=&quot;language-c++ highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;QuickSort&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;right&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;left&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kCutOff&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pivot&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ChoosePivotElement&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Important but not focus here&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Partition&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pivot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// The main work loop&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;QuickSort&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;QuickSort&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Tail call, ideally the largest sub-interval&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;SortSmallArray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Countless variations exist varying in the choice of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kCutOff&lt;/code&gt;, choice of the
sorting algorithm for the small arrays and choice of pivot element. These are
important for performance but the main work QuickSort performs is done in
the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Partition&lt;/code&gt; function. There are two canonical schemes for implementing
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Partition&lt;/code&gt;: the original Hoare scheme and the Lomuto scheme. The Hoare
partition scheme works by swapping elements that violate the partition property
from the front of the array with elements from the back of the array,
processing the array from the outside inwards converging on the partition point
somewhere in the middle.&lt;/p&gt;

&lt;div class=&quot;language-c++ highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;HoarePartition&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pivot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;while&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;left&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
     &lt;span class=&quot;n&quot;&gt;left&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ScanForward&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pivot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;left&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;break&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
     &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ScanBackward&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pivot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;left&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;break&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
     &lt;span class=&quot;n&quot;&gt;swap&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In contrast the Lomuto scheme processes the array from front to back,
maintaining a properly partitioned array for the elements processed so far at
each step.&lt;/p&gt;

&lt;div class=&quot;language-c++ highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;LomutoPartition&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pivot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;it&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;it&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;it&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;it&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pivot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;swap&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;it&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// Could be a self-swap&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;A remarkably simple loop, with an additional property that it’s stable with
respect to the ordering of the elements smaller than the pivot. It’s easy to
see that the above loop can be implemented without conditional branches. The
conditional swapping can be implemented by conditional moves and the
conditional pointer increase could be implemented as a unconditional &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;p += (*it
&amp;lt; pivot)&lt;/code&gt;. Andrei’s blog shows the performance gain of this simple branchless
loop over production quality implementations in various standard libraries.&lt;/p&gt;

&lt;h1 id=&quot;performance-analysis-of-branchless-lomuto-partitioning&quot;&gt;Performance analysis of branchless Lomuto partitioning&lt;/h1&gt;

&lt;p&gt;Here I want to take the optimizations further and dive deeper into the
performance analysis of sorting algorithms. When conditional branching is
removed the performance of code tends to become much more stable and easier to
understand, data dependencies become key. We are going to analyse the code in
terms of a self explanatory pseudo-assembly code, where named variables should
be thought of as CPU registers and loads/stores to memory are made explicit. In
this notation the basic loop above becomes&lt;/p&gt;

&lt;div class=&quot;language-c++ highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nl&quot;&gt;loop:&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;it&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;              &lt;span class=&quot;c1&quot;&gt;// 1&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;prev_val&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;          &lt;span class=&quot;c1&quot;&gt;// 2&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;is_smaller&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pivot&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;// 3&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;valnew&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cmov&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;is_smaller&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;prev_val&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;val&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// 4&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;prev_val&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cmov&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;is_smaller&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;val&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;prev_val&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// 5&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Store&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;it&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;valnew&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;           &lt;span class=&quot;c1&quot;&gt;// 6&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Store&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;prev_val&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;          &lt;span class=&quot;c1&quot;&gt;// 7&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;p&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;is_smaller&lt;/span&gt;             &lt;span class=&quot;c1&quot;&gt;// 8&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;it&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++&lt;/span&gt;                        &lt;span class=&quot;c1&quot;&gt;// 9&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;it&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;goto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;loop&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;How would it run on a parallel out-of order CPU? As a first approximation we
can pretend that the CPU is arbitrarily parallel, ie. it can execute unlimited
instructions in parallel as long as the instructions are independent. What is
important to understand is the loop carried dependency chains. They determine
the minimum latency a loop could possibly run. In the above code you see that
the dependency between iterations of the loop are carried by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;it&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;p&lt;/code&gt;. Only
line 8 and 9 participate on the loop carried chain and both are single cycle
instructions.  So we determine that the loop could potentially run at a
throughput of 1 cycle per iteration.&lt;/p&gt;

&lt;center&gt;
&lt;img src=&quot;https://docs.google.com/drawings/d/e/2PACX-1vRAynXWaiSSZ6s3cBKOPoFo6eDfIa3FUmM7ODtwX_8BblTjOVMvGw_JnnlDJqZBjQ5fA8AHcnnzKfPo/pub?w=740&quot; /&gt;
&lt;/center&gt;

&lt;p&gt;However this dependency analysis is not quite correct. There are also loads and
stores to memory. If you store something to a certain memory address and later
load from that address you must read the value of the previous store. That
means loads have dependencies on stores, or at least if the address overlaps.
Here &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;it&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;p&lt;/code&gt; are dynamic values and for sure they can overlap, depicted by
the dashed lines in the diagram above. So let’s add the fact that there is a
dependency between the loads at line 1 and 2 on the stores of line 6 and 7.&lt;/p&gt;

&lt;center&gt;
&lt;img src=&quot;https://docs.google.com/drawings/d/e/2PACX-1vSZZnMmf6LSF4uaN-VFVghW4TztbhvBGlByk6Aqqkl60dNxR_MpCxii8DD2fKb6eIpirQdCKe_UhtK0/pub?w=740&quot; /&gt;
&lt;/center&gt;

&lt;p&gt;This completely changes the game, now there is a long loop carried data
dependency, lines 6 and 7 depend on lines 4 and 5, which both depend on line 3
, which depend on the loads at lines 1 and 2, which potentially depend on the
stores at lines 6 and 7 of the previous iteration. If we count the cycles, we
get 5 cycles for the loads (loads itself can be done in parallel), 1 cycle for
the comparison, 1 cycle for the conditional move and 1 cycle for the store,
hence this loop will run ~8 cycles. A far cry from the 1 cycle iterations our
cursory discussion indicated.&lt;/p&gt;

&lt;p&gt;Although it’s not possible to reorder stores and loads in general it’s
essential for performance of a CPU to do so. Let’s take a simple memcpy loop&lt;/p&gt;

&lt;div class=&quot;language-c++ highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nl&quot;&gt;loop:&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;it&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Store&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;it&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;delta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;val&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;it&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;it&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;end&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;goto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;loop&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If the load cannot be reordered with the previous store, this is a 5 + 1 = 6
cycle latency loop. However in memcpy it’s guaranteed that loads and stores
never overlap. If the CPU would instead reorder, the above loop would execute
with a throughput of one iteration per cycle. It’s execution would look like,
ignoring the instructions needed for control flow.&lt;/p&gt;

&lt;div class=&quot;language-c++ highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;val_0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;it_0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;it_1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;it_0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;// Cycle 1&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;val_1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;it_1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;it_2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;it_1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;// Cycle 2&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;val_2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;it_2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;it_3&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;it_2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;// Cycle 3&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;val_3&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;it_3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;it_4&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;it_3&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;// Cycle 4&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// The value of the load at cycle 1 becomes available.&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// From this point all instructions of the loop are executed each cycle.&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;val_4&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;it_4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;it_5&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;it_4&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Store&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;val_0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;// Cycle 5&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;val_5&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;it_5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;it_6&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;it_5&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Store&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;val_1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;// Cycle 6&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In practice most of the stores preceding a load in the instruction stream are
in fact to different addresses and it is possible to reorder loads in front of
stores. Therefore CPU’s do reorder loads in front of stores, which is called
speculative loading, however the effects of the load are only committed when
it’s verified no store has invalidated the speculative load. If a preceding
store, in effect, invalidates the load, execution is rolled back to the load
and the CPU starts over. One can imagine that this is very costly and very akin
to branch mispredicts. While a lot of stores and loads are to different
addresses, there are also plenty of stores and loads to the same address, think
about register spills for example. Therefore the CPU uses a prediction model
based on instruction address to determine if a load has a dependency on
previous stores. In general the CPU is pretty conservative, the cost of a wrong
reordering is very high. In the code above the CPU will encounter loads from
the same address as recent stores and will be hesitant to do the necessary
re-ordering.&lt;/p&gt;

&lt;h2 id=&quot;revisiting-the-lomuto-partition-scheme&quot;&gt;Revisiting the Lomuto partition scheme&lt;/h2&gt;

&lt;p&gt;Looking closer it’s the load from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;p&lt;/code&gt; that is problematic. The load from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;it&lt;/code&gt;
is in fact always from a different address than the previous stores.
Furthermore the load at &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;p&lt;/code&gt; is also responsible for a lot of extra work in the
loop. It is necessary as otherwise the store at &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;p&lt;/code&gt; will corrupt values in the
array. The values that are overwritten are values previously encountered in the
scan and are those elements that are larger than the pivot. If instead we would
save these values in a temporary buffer there is no need to swap.&lt;/p&gt;

&lt;div class=&quot;language-c++ highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Distributes the elements [left, right) into begin of left and from end of&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// scratch buffer &lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;DistributeForward&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pivot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scratch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scratch_end&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scratch&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kScratchSize&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;ptrdiff_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;offset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;it&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;it&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;it&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;it&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;kt&quot;&gt;bool&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;is_larger&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pivot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dst&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;is_larger&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scratch_end&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;it&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;dst&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;offset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;val&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;offset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;is_larger&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;offset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ModifiedLomutoPartition&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pivot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scratch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DistributeForward&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pivot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scratch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// To complete the partition we need to copy the elements in scratch&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// to the end of the array.&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;memcpy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scratch&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kScratchSize&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;sizeof&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This is a much simpler loop, only one load and one store per iteration. More
importantly the load will never clash with a previous store. This loop runs
much faster than the original loop, it’s not 1 cycle per iteration but 2.5
cycles on my machine. This is indicative that it’s saturating the ILP of the
CPU. Unfortunately the above code is not in-place anymore, it requires O(n)
additional memory for the scratch buffer.&lt;/p&gt;

&lt;center&gt;
&lt;img src=&quot;https://docs.google.com/drawings/d/e/2PACX-1vTOJvXw1e0spAQWGEdogAHPoHsRd0YAjDia9EeZs9s9b2RPiO_plokZH-Q3jiRjr7cAZsVG2aa6fR_v/pub?w=740&quot; /&gt;
&lt;/center&gt;

&lt;h2 id=&quot;the-elegant-hybrid&quot;&gt;The elegant hybrid&lt;/h2&gt;

&lt;p&gt;If instead we use a smallish fixed size temporary buffer, we can still use the
above code except we need to abort when the fixed buffer is full. What do we do
then? The function returned the partition point &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;p&lt;/code&gt; where it left the loop. At
this point &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[left, p)&lt;/code&gt; have the correct elements smaller than pivot at the front
of the array. The scratch buffer is full with elements larger or equal to pivot
and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[p, p + kScratchSize)&lt;/code&gt; contains information we don’t need anymore. The
idea is that we can do the same algorithm but backwards, we can use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[p, p +
kScratchSize)&lt;/code&gt; as a temporary buffer. Notice how &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DistributeForward()&lt;/code&gt;
fills the scratch buffer from back to front; the backwards version would fill
the scratch from front to back. So performing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DistributeBackwards()&lt;/code&gt; using the
interval  &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[p, p + kScratchSize)&lt;/code&gt; as scratch will neatly pack all smaller
elements encountered to the correct place. This continues until the scratch
space is full, but now a new scratch space at the end of the array opened up.
Wait, this looks like Hoare’s algorithm but hybridized with the Lomuto-inspired
distribute function.&lt;/p&gt;

&lt;div class=&quot;language-c++ highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;ModifiedHoarePartition&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pivot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scratch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pleft&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DistributeForward&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pivot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scratch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;right&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pleft&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kScratchSize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pleft&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;memcpy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pleft&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scratch&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kScratchSize&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;sizeof&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pleft&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;left&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pleft&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kScratchSize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;res&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;while&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DistributeBackward&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pivot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;left&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kScratchSize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kScratchSize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;right&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;res&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;break&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;left&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DistributeForward&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pivot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kScratchSize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;right&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;res&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;left&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kScratchSize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;break&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;memcpy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;res&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scratch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kScratchSize&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;sizeof&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;res&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;What we are ending up with is an in-place algorithm that’s almost branchfree.
The precise number of iterations in the modified Lomuto partitioning scheme
depend on the outcome of the comparisons and will be difficult to predict
exactly right. However the loop is guaranteed to iterate at least kScratchSize
times. This basically amortises the cost of branch mispredictions over many
elements making them irrelevant for performance. I consider this to be a truly
elegant design.&lt;/p&gt;

&lt;h2 id=&quot;fallback-for-sorting-short-arrays&quot;&gt;Fallback for sorting short arrays&lt;/h2&gt;

&lt;p&gt;The next step is the fallback for short arrays where the overhead of recursion
starts dominating. In the literature insertion sort is most often recommended
together with countless micro-optimizations applied. I found that at this point
partitioning was so fast that QuickSort beat insertion sort all the way down to
just a few elements. The problem is that insertion sort has unpredictable
branches, basically 1 miss per insert. The solution is Bubble sort. Bubble sort
has a very predictable access pattern and the swap can be implemented
branchless. A little more optimizing you discover you don’t need a swap. One
can keep the maximum in register and store the minimum.&lt;/p&gt;

&lt;div class=&quot;language-c++ highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;BubbleSort&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;size_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;size_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;--&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;size_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;j&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;max&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;max&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;max&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In the above inner-loop &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;max&lt;/code&gt; is the variable that participates on the longest
loop carried data chain, with a compare and conditional move on it. This makes
the above loop execute in 2 cycles per iteration. However we can do better. If
instead of bubbling the max, we’re bubbling up the two largest elements we only
need to iterate the bubbling stage ++n/2++ times, instead of ++n++ times. It
turns out that using a clever implementation one can bubble two elements in the
same 2 cycles.&lt;/p&gt;

&lt;div class=&quot;language-c++ highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;BubbleSort2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;size_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;size_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;swap&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;size_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;z&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;
      &lt;span class=&quot;kt&quot;&gt;bool&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;is_smaller&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;z&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;w&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;is_smaller&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;z&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;is_smaller&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;z&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;is_smaller&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;z&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;j&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;is_smaller&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;z&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;is_smaller&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;w&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;arr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The benchmarks verify that indeed this makes a difference.&lt;/p&gt;

&lt;div class=&quot;language-c++ highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;BM_SmallSort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;exp_gerbens&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;BubbleSort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;            &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ns&lt;/span&gt;          &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ns&lt;/span&gt;  &lt;span class=&quot;mi&quot;&gt;400700000&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;BM_SmallSort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;exp_gerbens&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;BubbleSort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;            &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ns&lt;/span&gt;          &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ns&lt;/span&gt;  &lt;span class=&quot;mi&quot;&gt;155200000&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;BM_SmallSort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;exp_gerbens&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;BubbleSort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;          &lt;span class=&quot;mi&quot;&gt;14&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ns&lt;/span&gt;         &lt;span class=&quot;mi&quot;&gt;14&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ns&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;52600000&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;BM_SmallSort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;exp_gerbens&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;BubbleSort2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;           &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ns&lt;/span&gt;          &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ns&lt;/span&gt;  &lt;span class=&quot;mi&quot;&gt;514500000&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;BM_SmallSort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;exp_gerbens&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;BubbleSort2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;           &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ns&lt;/span&gt;          &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ns&lt;/span&gt;  &lt;span class=&quot;mi&quot;&gt;256700000&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;BM_SmallSort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;exp_gerbens&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;BubbleSort2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;          &lt;span class=&quot;mi&quot;&gt;9&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ns&lt;/span&gt;          &lt;span class=&quot;mi&quot;&gt;9&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ns&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;78400000&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;BM_SmallSort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;InsertionSort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;                      &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ns&lt;/span&gt;          &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ns&lt;/span&gt;  &lt;span class=&quot;mi&quot;&gt;183600000&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;BM_SmallSort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;InsertionSort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;                     &lt;span class=&quot;mi&quot;&gt;11&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ns&lt;/span&gt;         &lt;span class=&quot;mi&quot;&gt;11&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ns&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;63500000&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;BM_SmallSort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;InsertionSort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;                    &lt;span class=&quot;mi&quot;&gt;17&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ns&lt;/span&gt;         &lt;span class=&quot;mi&quot;&gt;17&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ns&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;42000000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In fact it’s possible to generalize this, one can bubble up the top ++N++ in
constant cycles per iteration in the model where the CPU has arbitrary ILP.&lt;/p&gt;

&lt;h2 id=&quot;bringing-it-all-together&quot;&gt;Bringing it all together&lt;/h2&gt;

&lt;p&gt;What are the results? The results are nothing short of spectacular. The
following is a simple benchmark on sorting 100000 int’s. The time is normalized
by the number of elements, so it’s the amount of time spent per element. I’m
using clang + libc++ here, as gcc is dramatically worse in emitting branch free
code.&lt;/p&gt;

&lt;div class=&quot;language-c++ highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nl&quot;&gt;CPU:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Intel&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Skylake&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Xeon&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;HyperThreading&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;36&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cores&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dL1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;KB&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dL2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1024&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;KB&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dL3&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;24&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;MB&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Benchmark&lt;/span&gt;                         &lt;span class=&quot;n&quot;&gt;Time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ns&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;CPU&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ns&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;     &lt;span class=&quot;n&quot;&gt;Iterations&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;------------------------------------------------------------------------&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;BM_Sort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std_sort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;                       &lt;span class=&quot;mf&quot;&gt;51.6&lt;/span&gt;           &lt;span class=&quot;mf&quot;&gt;51.6&lt;/span&gt;     &lt;span class=&quot;mi&quot;&gt;10000000&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;BM_Sort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std_stable_sort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;                &lt;span class=&quot;mf&quot;&gt;65.6&lt;/span&gt;           &lt;span class=&quot;mf&quot;&gt;65.6&lt;/span&gt;     &lt;span class=&quot;mi&quot;&gt;10000000&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;BM_Sort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lib_qsort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;                      &lt;span class=&quot;mf&quot;&gt;90.4&lt;/span&gt;           &lt;span class=&quot;mf&quot;&gt;90.5&lt;/span&gt;      &lt;span class=&quot;mi&quot;&gt;7800000&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;BM_Sort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;andrei_sort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;                    &lt;span class=&quot;mf&quot;&gt;32.6&lt;/span&gt;           &lt;span class=&quot;mf&quot;&gt;32.6&lt;/span&gt;     &lt;span class=&quot;mi&quot;&gt;21500000&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;BM_Sort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;exp_gerbens&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QuickSort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;         &lt;span class=&quot;mf&quot;&gt;16.4&lt;/span&gt;           &lt;span class=&quot;mf&quot;&gt;16.4&lt;/span&gt;     &lt;span class=&quot;mi&quot;&gt;43200000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We’re talking about a 2x win over Andrei’s implementation as copied from his
github. The code is available at my &lt;a href=&quot;https://github.com/gerben-s/quicksort-blog-post&quot;&gt;GitHub&lt;/a&gt;,
although it doesn’t contain benchmarks for the code as published by Andrei as
it didn’t contain a license.&lt;/p&gt;

&lt;p&gt;We’ve seen how crucial it is to understand data dependencies in order to
optimize code. Especially hidden memory dependencies between load and stores
can greatly influence performance of work loops. Understanding the data
dependency graph of code is often where the real performance gains lie, yet
very little attention is given to it in the blogosphere. I’ve read many
articles about the impact of branch mispredictions, importance of data locality
and caches, but much less about data dependencies. I bet that a question like
“why are linked lists slow?” is answered by many in terms of locality, caches
or unpredictable random memory access. At least I’ve heard those reasons often,
even &lt;a href=&quot;https://youtu.be/YQs6IC-vgmo?t=215&quot;&gt;Stroustrup&lt;/a&gt; says as much. Those
reasons can play a part, but it’s not the main reason. Fundamentally iterating
a linked list has a load-to-use on the critical path, making it 5 times slower
than iterating a flat array. Furthermore accessing flat arrays allow loop
unrolling which can further improve ILP.&lt;/p&gt;

&lt;h1 id=&quot;why-is-quicksort-fast-compared-to-merge-and-heap-sort&quot;&gt;Why is QuickSort fast compared to merge and heap sort?&lt;/h1&gt;

&lt;p&gt;This brings us to answer why QuickSort is fast compared to other sorts with
good or even better theoretical complexity. It’s all about data dependencies.
The quick sort partition loop above demonstrates a distinctive feature. The
element it will process next does not depend on the outcome of the comparisons
in previous iterations. Compare this to merge sort. In merge sort the two head
elements are compared, however the next elements that need to be compared
depend on the outcome of this comparison. It’s trivial to implement merge sort
branch free. It will look like&lt;/p&gt;

&lt;div class=&quot;language-c++ highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;val1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;left&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;k&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right_idx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// 1&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;val2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;right&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right_idx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;is_smaller&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;val2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;val1&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// 2&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;tmp&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cmov&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;is_smaller&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;val2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;val1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Store&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;out&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;k&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tmp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;k&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;right_idx&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;is_smaller&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// 3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This is about 8 cycles per iteration to update &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;right_idx&lt;/code&gt;, we have a load and
non-trivial indexing at line 1 (6 cycles), and line 2 and 3 both being 1 cycle.
Similar analysis holds for heap sort, restoring the heap property requires
comparing the two children and recurse on the subtree of the biggest child.&lt;/p&gt;

&lt;div class=&quot;language-c++ highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;left&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;idx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;right&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;idx&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;is_smaller&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;left&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;tmp&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cmov&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;is_smaller&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Store&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;idx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tmp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;idx&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;idx&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;is_smaller&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Again this is 8 cycles on the loop carried data chain. This goes to the heart
of the matter of why QuickSort is fast even though theoretically it has
inferior behavior compared to heap and merge sort. By construction heap and
merge sort divide the data very evenly in the implied tree structure, the heap
structure is a binary tree of minimum depth as is the recursive subdivision
that merge sort performs. This means that the number of comparisons they do is
++n \lg(n)++ which tightly hugs the information theoretical lower bound of
++\lg(n!)++.&lt;/p&gt;

&lt;p&gt;In contrast, QuickSort bets on obtaining a reasonably balanced tree with high
probability. The bits of information extracted from a comparison with the pivot
depends on its rank in the array. Only pivots that are close to the median will
result in obtaining 1 bit of information per comparison. Solving the recurrence
equation of QuickSort with a uniform pivot choice, gives ++2n \ln{n}++ as the
number of comparisons on average. Hence QuickSort does a factor ++2 \ln(2) =
1.4++ more comparisons than heap sort or merge sort on average, or equivalently
more iterations in the inner work loop. However the enormous speed difference
in the basic work loop more than compensates for this information theoretical
factor.&lt;/p&gt;

&lt;p&gt;Also, when partitioning big arrays spending a little amount of work improving
the choice of pivot by a median of three brings this factor down to ~1.2.
Further improvements can get this factor rather close to 1. The main point here
is that this factor is dominated by the differences in throughput of the work
loop.&lt;/p&gt;

&lt;p&gt;We can significantly speed up merge sort with a simple trick. Due to the large
dependency chain, merge sort loop runs at a very low IPC. This basically means
we can add more instructions for free. In particular merge sort has a backwards
equivalent. We can merge forward and backward in a single loop while keeping
the latency of the loop body the same. It also eliminates an awkward exit
condition as now you can unconditionally iterate n/2 times. This reduces the
number of iterations roughly by 2x.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;BM_Sort&amp;lt;exp_gerbens::QuickSort&amp;gt;                       10.7           10.7     65011712
BM_MergeSort&amp;lt;MergeSort&amp;gt;                               23.4           23.4     29884416
BM_MergeSort&amp;lt;MergeSortDouble&amp;gt;                         13.2           13.2     52756480
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Another trick is preloading values for the next iteration,&lt;/p&gt;

&lt;div class=&quot;language-c++ highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;next_val1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;left&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;k&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right_idx&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;next_val2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;right&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;right_idx&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;is_smaller&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;val2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;val1&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;tmp&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cmov&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;is_smaller&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;val2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;val1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;val1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cmov&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;is_smaller&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;val1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;next_val1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;val2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cmov&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;is_smaller&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;next_val2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;val2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Store&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;out&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;k&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tmp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;k&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;right_idx&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;is_smaller&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;You see that in a single iteration the increase of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;right_idx&lt;/code&gt; does not depend
on a load as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;val1&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;val2&lt;/code&gt; are already available at the start of the
iteration. Over two iterations one can see that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;right_idx&lt;/code&gt; depends on itself.
This chain is ~8 cycles long but spread over 2 iterations which gives a
throughput of ~4 cycles per iteration. Combining these two tricks could lead to
merge sort on par with the simple partition loop of QuickSort. However it’s
just a mitigation. If instead of sorting a simple primitive value we would sort
pointers to a struct. The comparison operator would have an extra load which
immediately adds to the critical path. QuickSort is immune to this, even rather
costly comparisons do not influence the ability to make progress. Of course if
the comparison itself suffers from frequent branch misses, then that will limit
the ability to overlap different stages of the iteration in the CPU pipeline.&lt;/p&gt;

&lt;p&gt;A lot of thanks to Josh for discussions, helpful suggestions, his insights and
for providing space on his blog for this post.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;em&gt;&lt;a href=&quot;https://news.ycombinator.com/item?id=23363165&quot;&gt;Discuss this article on Hacker News.&lt;/a&gt;&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        <pubDate>Fri, 29 May 2020 00:00:00 +0000</pubDate>
        <link>https://blog.reverberate.org/2020/05/29/hoares-rebuttal-bubble-sorts-comeback.html</link>
        <guid isPermaLink="true">https://blog.reverberate.org/2020/05/29/hoares-rebuttal-bubble-sorts-comeback.html</guid>
        
        
      </item>
    
    
      <item>
        <title>Optimizing UTC → Unix Time Conversion For Size And Speed</title>
        <description>&lt;p&gt;How do you convert a UTC timestamp to &lt;a href=&quot;https://en.wikipedia.org/wiki/Unix_time&quot;&gt;Unix
Time&lt;/a&gt; (seconds since the
epoch)?&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; &quot;2020-04-29 04:48:15&quot; → 1588135695
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Of course the right answer is “you use a standard library function.”
But what if you don’t have one available? Or what if you’re the person
implementing that library?&lt;/p&gt;

&lt;p&gt;Converting the time portion is trivial. Unix Time pretends that leap
seconds do not exist and makes every day exactly 86,400 seconds long.
This is a fib on systems that implement UTC leap second insertion&lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;,
but it makes the algorithm very simple:&lt;/p&gt;

&lt;div class=&quot;language-c++ highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kt&quot;&gt;time_t&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;hms_to_time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;h&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;m&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;h&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3600&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;m&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;60&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;But the calendar part is more challenging.  Months have unequal
lengths, and leap years complicate everything.  Leap years insert an
extra day at the end of February whenever:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;the year is divisible by four&lt;/li&gt;
  &lt;li&gt;excluding years divisible by 100&lt;/li&gt;
  &lt;li&gt;but &lt;em&gt;including&lt;/em&gt; years divisible by 400&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Surprisingly, no version of C includes a UTC → Unix Time conversion
function in the standard library. There is a non-standard function
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;timegm()&lt;/code&gt;, but its use is discouraged. The &lt;a href=&quot;http://man7.org/linux/man-pages/man3/timegm.3.html#CONFORMING_TO&quot;&gt;Linux
manpage&lt;/a&gt;
says:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;  These functions are nonstandard GNU extensions that are also present
  on the BSDs.  Avoid their use.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And &lt;a href=&quot;https://www.freebsd.org/cgi/man.cgi?query=timegm&amp;amp;apropos=0&amp;amp;sektion=3&amp;amp;manpath=FreeBSD+6.2-RELEASE&amp;amp;format=html#end&quot;&gt;on
BSD&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;  The timegm() function is not specified by any standard; its function
  cannot be completely emulated using the standard functions described above.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I needed an algorithm to perform this UTC→UnixTime conversion for the
JSON parser in &lt;a href=&quot;https://github.com/protocolbuffers/upb&quot;&gt;upb&lt;/a&gt;. The
&lt;a href=&quot;https://developers.google.com/protocol-buffers/docs/proto3#json&quot;&gt;JSON mapping for Protocol
Buffers&lt;/a&gt;
says that timestamps are formatted using strings like:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;1972-01-01T10:00:20.021Z
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Once we have parsed the individual numbers out of such a timestamp
string, we need a way of translating to seconds since the Unix Epoch,
which is the &lt;a href=&quot;https://github.com/protocolbuffers/protobuf/blob/bb3460d71b2f2cd75f10efe94d739e15561c2ccf/src/google/protobuf/timestamp.proto#L43-L138&quot;&gt;internal representation of the
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;google.protobuf.Timestamp&lt;/code&gt;
type&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Since upb is written in C and intended to be portable, I needed to
roll my own. Since upb aims to be as small and fast as possible, I
became very interested in the problem of how far this algorithm could
be pushed in size and speed.&lt;/p&gt;

&lt;h1 id=&quot;a-fortran-solution-from-1968&quot;&gt;A Fortran Solution from 1968&lt;/h1&gt;

&lt;p&gt;Several leads pointed me to a &lt;a href=&quot;https://dl.acm.org/doi/pdf/10.1145/364096.364097?download=true&quot;&gt;Letter to the
Editor&lt;/a&gt;
published in &lt;em&gt;Communications of the ACM&lt;/em&gt; in October 1968 (Volume 11,
Number 10). In this letter Henry F. Fliegel and Thomas C. Van Flandern
offered the following Fortran function for performing this conversion.
This function uses integer math (all divisions round down):&lt;/p&gt;

&lt;div class=&quot;language-fortran highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;JD&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;I&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;J&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;K&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;K&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;32075&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1461&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;I&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4800&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;J&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;14&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;367&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;J&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;J&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;14&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
             &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;I&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4900&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;J&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;14&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;JD&lt;/code&gt; here refers to the &lt;a href=&quot;https://en.wikipedia.org/wiki/Julian_day&quot;&gt;Julian
Date&lt;/a&gt;, the number of days
that have passed since the Julian Date Epoch. This epoch is very long
ago, in 4713 BC, but it’s a simple matter to convert it to the Unix
epoch of 1970-01-01 by subtracting 2440588, the Julian Day number of
the Unix Epoch. The variables &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;I&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;J&lt;/code&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;K&lt;/code&gt; are the year, month,
and day, respectively.&lt;/p&gt;

&lt;p&gt;Once we have mapped a Y/M/D to a total number of days, it is easy to
multiply this by 86,400 seconds and add &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hms_to_time()&lt;/code&gt; to get a full
Unix Time result.&lt;/p&gt;

&lt;p&gt;The function above is remarkable. It does not use any lookup table,
and yet it somehow encodes the irregular pattern of month lengths and
leap year rules. It is trivial to convert this Fortran function to C
using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;int&lt;/code&gt; data type, which also rounds down for positive numbers
(the algorithm ensures that all numbers being divided are positive, as
long as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;year &amp;gt;= -4800&lt;/code&gt;).&lt;/p&gt;

&lt;h1 id=&quot;improving-readability-with-a-lookup-table&quot;&gt;Improving Readability With A Lookup Table&lt;/h1&gt;

&lt;p&gt;In the interest of having an algorithm that is easier to understand
than the Fortran algorithm above, I wrote the following algorithm &lt;a href=&quot;https://github.com/protocolbuffers/upb/blob/22182e6e54e892f93f26d0522487997d30f604af/upb/json/parser.rl#L1697-L1706&quot;&gt;for
upb’s old JSON
parser&lt;/a&gt;.
All of the general techniques used here are published in various
places, I just condensed them into the smallest and simplest algorithm
I could.  This uses the Unix epoch instead of a Julian Day.&lt;/p&gt;

&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cm&quot;&gt;/* https://github.com/protocolbuffers/upb/blob/22182e6e/upb/json/parser.rl#L1697
 * epoch_days(1970, 1, 1) == 1970-01-01 == 0 */&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;epoch_days_table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;year&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;month&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;day&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;uint16_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;month_yday&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;31&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mi&quot;&gt;59&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mi&quot;&gt;90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mi&quot;&gt;120&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;151&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                                          &lt;span class=&quot;mi&quot;&gt;181&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;212&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;243&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;273&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;304&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;334&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;year_adj&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;year&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4800&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;  &lt;span class=&quot;cm&quot;&gt;/* Ensure positive year, multiple of 400. */&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;febs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;year_adj&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;month&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;  &lt;span class=&quot;cm&quot;&gt;/* Februaries since base. */&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;leap_days&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;febs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;febs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;febs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;400&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;days&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;365&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;year_adj&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;leap_days&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;month_yday&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;month&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;day&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;days&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2472692&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;  &lt;span class=&quot;cm&quot;&gt;/* Adjust to Unix epoch. */&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This algorithm uses a lookup table to determine the number of days for
each month. Like the Fortran algorithm, it requires that division
rounds down. But C rounds towards zero&lt;sup id=&quot;fnref:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; so like the Fortran
algorithm above, we start by adding 4800 so the year is always
positive.&lt;/p&gt;

&lt;p&gt;This algorithm is nice and readable. It is small and fast.  There is
only one noticeable downside if we are being performance-obsessed: it
relies on a lookup table, which adds a bit of latency in the best
case, or a lot of latency if the lookup table is not in cache.&lt;/p&gt;

&lt;h1 id=&quot;eliminating-the-lookup-table&quot;&gt;Eliminating the Lookup Table&lt;/h1&gt;

&lt;p&gt;More recently I came across &lt;a href=&quot;http://howardhinnant.github.io/date_algorithms.html#days_from_civil&quot;&gt;the following
algorithm&lt;/a&gt;
published by Howard Hinnant. Like the Fortran algorithm, it avoids the
lookup table. But its logic is a bit easier to follow, and Howard
includes &lt;a href=&quot;http://howardhinnant.github.io/date_algorithms.html#days_from_civil&quot;&gt;an extensive explanation for how it works&lt;/a&gt;.&lt;/p&gt;

&lt;div class=&quot;language-c++ highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Returns number of days since civil 1970-01-01.  Negative values indicate&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//    days prior to 1970-01-01.&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Preconditions:  y-m-d represents a date in the civil (Gregorian) calendar&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//                 m is in [1, 12]&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//                 d is in [1, last_day_of_month(y, m)]&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//                 y is &quot;approximately&quot; in&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//                   [numeric_limits&amp;lt;Int&amp;gt;::min()/366, numeric_limits&amp;lt;Int&amp;gt;::max()/366]&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//                 Exact range of validity is:&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//                 [civil_from_days(numeric_limits&amp;lt;Int&amp;gt;::min()),&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//                  civil_from_days(numeric_limits&amp;lt;Int&amp;gt;::max()-719468)]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;template&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;constexpr&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Int&lt;/span&gt;
&lt;span class=&quot;nf&quot;&gt;days_from_civil&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;unsigned&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;m&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;unsigned&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;noexcept&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;static_assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;numeric_limits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;unsigned&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;digits&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;18&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
             &lt;span class=&quot;s&quot;&gt;&quot;This algorithm has not been ported to a 16 bit unsigned integer&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;static_assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;numeric_limits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;digits&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
             &lt;span class=&quot;s&quot;&gt;&quot;This algorithm has not been ported to a 16 bit signed integer&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;m&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;era&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;399&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;400&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;unsigned&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;yoe&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;static_cast&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;unsigned&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;era&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;400&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;      &lt;span class=&quot;c1&quot;&gt;// [0, 399]&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;unsigned&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;doy&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;153&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;m&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;m&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;9&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// [0, 365]&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;unsigned&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;doe&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;yoe&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;365&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;yoe&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;yoe&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;doy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;         &lt;span class=&quot;c1&quot;&gt;// [0, 146096]&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;era&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;146097&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;static_cast&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;doe&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;719468&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This algorithm replaces the month lookup table with the clever
expression:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;  DayOfYear(adjusted_month) = (153 * adjusted_month + 2) / 5
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This formula describes a line with slope of 153/5, or 30.6, which
is between 30 and 31 days in each month. But it is cleverly chosen to
exploit truncating integer division to step in the same pattern that
months do.&lt;/p&gt;

&lt;p&gt;My colleague Gerben Stavenga realized that there are many such
expressions of this form that will calculate the same value. If we use
a variant where the divisor is a power of two, the compiler can
optimize the division to a right shift. Specifically we can use:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;  DayOfYear(adjusted_month) = (62719 * adjusted_month + 769) / 2048
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;With some help from Gerben I arrived at &lt;a href=&quot;https://github.com/protocolbuffers/upb/blob/22182e6e/upb/json_decode.c#L982-L992&quot;&gt;this formulation for the
current version of
upb&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cm&quot;&gt;/* https://github.com/protocolbuffers/upb/blob/22182e6e/upb/json_decode.c#L982
 * epoch_days_fast(1970, 1, 1) == 1970-01-01 == 0. */&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;epoch_days_fast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;m&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;year_base&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4800&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;    &lt;span class=&quot;cm&quot;&gt;/* Before min year, multiple of 400. */&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;m_adj&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;m&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;       &lt;span class=&quot;cm&quot;&gt;/* March-based month. */&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;carry&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;m_adj&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;m&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;adjust&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;carry&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y_adj&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;year_base&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;carry&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;month_days&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;m_adj&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;adjust&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;62719&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;769&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2048&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;leap_days&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y_adj&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y_adj&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y_adj&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;400&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y_adj&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;365&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;leap_days&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;month_days&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2472632&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note that the although this algorithm has some ternary operators, it
&lt;a href=&quot;https://godbolt.org/z/bSv4-e&quot;&gt;compiles into fully branchless code&lt;/a&gt;
with one &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cmov&lt;/code&gt; in it. In fact, all the algorithms given in this
article compile to branchless code (on x86-64 using recent versions of
Clang).&lt;/p&gt;

&lt;h1 id=&quot;semantics-and-correctness&quot;&gt;Semantics and Correctness&lt;/h1&gt;

&lt;p&gt;Before we compare these algorithms for size and speed, a few notes
about their semantics and correctness.&lt;/p&gt;

&lt;h2 id=&quot;supported-input-range&quot;&gt;Supported Input Range&lt;/h2&gt;

&lt;p&gt;I verified all the algorithms in this article across a wide range of
dates (-4712-01-01 to 22666-12-20, or Julian Date 1 to 10,000,000).
For every day in this range I checked that the outputs match each
other, and that they match the results of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;timegm()&lt;/code&gt;. There was a snag
on macOS, as its &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;timegm()&lt;/code&gt; function fails for all years prior to
1900, so I disabled it for those years.&lt;/p&gt;

&lt;p&gt;The interval [-4712, 22666] is a very wide range of years, and likely
to be far beyond what is needed in practice. That said, these
algorithms effectively support much wider bounds than this.  All of
them support to at least the date 1000000-01-01 (year 1 Million).&lt;/p&gt;

&lt;p&gt;Most of the algorithms given here use the year -4800 as their base,
and will return incorrect results for years before that. However the
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;days_from_civil()&lt;/code&gt; algorithm is coded to support over 10 million years
into the past (assuming 32-bit integers). I think any of these
algorithms could likely be adjusted to support whatever window of
years is desired. The year -4800 is frequently chosen as a base out of
convention, as it is before the Julian Day epoch, which is before any
recorded historical events.&lt;/p&gt;

&lt;h2 id=&quot;input-normalization&quot;&gt;Input Normalization&lt;/h2&gt;

&lt;p&gt;Unlike the algorithms in this article, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;timegm()&lt;/code&gt; function
normalizes its input. For example, if you pass a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;month&lt;/code&gt; of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-1&lt;/code&gt; to
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;timegm()&lt;/code&gt;, that indicates December of the previous year. The
algorithms given here can return incorrect results if the month is out
of range, or can even crash with out-of-bounds array access if you are
using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;epoch_days_table()&lt;/code&gt; function. So any normalization/checking
of the month value must happen prior to calling the function.&lt;/p&gt;

&lt;p&gt;Similarly &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;timegm()&lt;/code&gt; will write the normalized values back to the
input &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;struct tm&lt;/code&gt;, including &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tm_yday&lt;/code&gt; (day of year) and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tm_wday&lt;/code&gt;
(day of week) values that it calculated from the day, month, and year.
The algorithms in this article do not calculate day of week or day of
year.&lt;/p&gt;

&lt;p&gt;I did not need these features for my use case, so for me these
represent unnecessary overhead. But it is worth noting that this is
not a completely apples-to-apples comparison. It could be an
interesting exercise to add these features so that the algorithms
given here could be a drop-in replacement for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;timegm()&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id=&quot;leap-seconds&quot;&gt;Leap Seconds&lt;/h2&gt;

&lt;p&gt;When we use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hms_to_time()&lt;/code&gt; function above, these algorithms
return correct results around leap seconds. The primary test case for
this is to verify:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;  UnixTime(1998, 12, 31, 23, 59, 60) -&amp;gt; 915148800
  UnixTime(1999,  1,  1,  0,  0,  0) -&amp;gt; 915148800
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;These values follow the table given in the &lt;a href=&quot;https://en.wikipedia.org/wiki/Unix_time#Leap_seconds&quot;&gt;Wikipedia Article about
Unix Time around leap
seconds&lt;/a&gt;. These
algorithms handle leap seconds correctly without any special code.
The correct answers for seconds=60 fall out of the algorithm “for
free”, as we are simply summing seconds.&lt;/p&gt;

&lt;h1 id=&quot;size-and-speed-benchmarks&quot;&gt;Size and Speed Benchmarks&lt;/h1&gt;

&lt;p&gt;Here are microbenchmarks of the algorithms above, using the
&lt;a href=&quot;https://github.com/google/benchmark&quot;&gt;google/benchmark framework&lt;/a&gt;.
Code for these benchmarks is available
&lt;a href=&quot;https://github.com/haberman/blog-code/tree/master/date-algorithms&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I added the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hms_to_time()&lt;/code&gt; code to each algorithm so they would be
more comparable to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;timegm()&lt;/code&gt; function. I also put the functions
in a separate &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.cc&lt;/code&gt; file from the benchmark loop so that they could
not be inlined, to make the comparison with libc more fair.  I also
translated the 1968 algorithm from Fortran into C++.&lt;/p&gt;

&lt;p&gt;On my Linux Desktop i7-8700K with glibc 2.30, I get:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Running bazel-bin/date-algorithms
Run on (12 X 4700 MHz CPU s)
CPU Caches:
  L1 Data 32K (x6)
  L1 Instruction 32K (x6)
  L2 Unified 256K (x6)
  L3 Unified 12288K (x1)
------------------------------------------------------------------
Benchmark                           Time           CPU Iterations
------------------------------------------------------------------
BM_YMDToUnix_Fortran                5 ns          5 ns  148069867
BM_YMDToUnix_Table                  2 ns          2 ns  288071123
BM_YMDToUnix_DaysFromCivil          4 ns          4 ns  176474060
BM_YMDToUnix_Fast                   3 ns          3 ns  269155478
BM_timegm_libc                     46 ns         46 ns   15360235
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;All of the algorithms from this article are in the same general
ballpark.  The algorithm in libc is noticeably slower.&lt;/p&gt;

&lt;p&gt;On my Mac Laptop running macOS 10.15.4, the disparity is even more
stark:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Running bazel-bin/date-algorithms
Run on (12 X 2600 MHz CPU s)
CPU Caches:
  L1 Data 32K (x6)
  L1 Instruction 32K (x6)
  L2 Unified 262K (x6)
  L3 Unified 9437K (x1)
------------------------------------------------------------------
Benchmark                           Time           CPU Iterations
------------------------------------------------------------------
BM_YMDToUnix_Fortran                6 ns          6 ns  116857534
BM_YMDToUnix_Table                  3 ns          3 ns  213395767
BM_YMDToUnix_DaysFromCivil          5 ns          5 ns  147238851
BM_YMDToUnix_Fast                   3 ns          3 ns  211537241
BM_timegm_libc                   5421 ns       5413 ns     122749
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I looked at a profile and cross-referenced the Darwin code in
&lt;a href=&quot;https://opensource.apple.com/source/Libc/Libc-594.1.4/stdtime/FreeBSD/localtime.c&quot;&gt;localtime.c&lt;/a&gt;.
It appears that almost all of the time is attributable to the
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gmtsub()&lt;/code&gt; function, of which about 70% is mutex lock/unlock to lazily
load time zone data for the “GMT” time zone.  The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;timesub()&lt;/code&gt; function
appears to take most of the remainder.&lt;/p&gt;

&lt;p&gt;It is surprising that a lock would add this much overhead, as &lt;a href=&quot;https://gist.github.com/jboner/2841832&quot;&gt;Jeff
Dean’s Latency Numbers Every Programmer Should
Know&lt;/a&gt; tells us to expect only
25ns for a mutex lock/unlock. Note that the benchmarks above are
single-threaded; under contention &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;timegm()&lt;/code&gt; on macOS degrades even
more, while the others are unaffected (as they do not access any
global state):&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Running bazel-bin/date-algorithms
Run on (12 X 2600 MHz CPU s)
CPU Caches:
  L1 Data 32K (x6)
  L1 Instruction 32K (x6)
  L2 Unified 262K (x6)
  L3 Unified 9437K (x1)
----------------------------------------------------------------------------
Benchmark                                     Time           CPU Iterations
----------------------------------------------------------------------------
BM_YMDToUnix_Fortran/threads:6                1 ns          6 ns  119878410
BM_YMDToUnix_Table/threads:6                  1 ns          3 ns  206599374
BM_YMDToUnix_DaysFromCivil/threads:6          1 ns          5 ns  139911858
BM_YMDToUnix_Fast/threads:6                   1 ns          3 ns  208116546
BM_timegm_libc/threads:6                  14795 ns      86855 ns       8178
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;YMDToUnix_Table&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;YMDToUnix_Fast&lt;/code&gt; are basically tied, though as
mentioned before the table-based algorithm can slow down drastically
(on the order of 100ns) if the table is not in cache, and is
vulnerable to crashes if the month is out of range.&lt;/p&gt;

&lt;p&gt;For size profiling I used my tool &lt;a href=&quot;https://github.com/google/bloaty&quot;&gt;Bloaty
McBloatface&lt;/a&gt;. I display VM size
only, so we see the parts of the binary that will actually be loaded
into memory:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ bloaty -d symbols,sections &apos;--source-filter=^YMD|yday&apos; --domain=vm \
  bazel-bin/date-algorithms
     VM SIZE
 -------------- 
  31.1%     236    YMDToUnix_Fortran()
    86.4%     204    .text
    10.2%      24    .eh_frame
     3.4%       8    .eh_frame_hdr
  27.7%     210    YMDToUnix_DaysFromCivil()
    81.0%     170    .text
    15.2%      32    .eh_frame
     3.8%       8    .eh_frame_hdr
  20.0%     152    YMDToUnix_Fast()
    78.9%     120    .text
    15.8%      24    .eh_frame
     5.3%       8    .eh_frame_hdr
  18.1%     137    YMDToUnix_Table()
    76.6%     105    .text
    17.5%      24    .eh_frame
     5.8%       8    .eh_frame_hdr
   3.2%      24    epoch_days_table()::month_yday
   100.0%      24    .rodata
 100.0%     759    TOTAL
Filtering enabled (source_filter); omitted  128Ki of entries
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;All these functions have nearly the same &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.eh_frame&lt;/code&gt; overhead (this is
unwind information, used for debugging and stack traces), so we can
set that aside.  When we account for the table size, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;YMDToUnix_Fast&lt;/code&gt;
ends up smallest, narrowly beating &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;YMDToUnix_Table&lt;/code&gt;. But these
differences are very minor.  All of these algorithms are very small
and fast.&lt;/p&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;UTC days tick 86,401 seconds whenever a leap second occurs.
  Maintaining the fib that every day is 86,400 seconds long
  &lt;a href=&quot;https://en.wikipedia.org/wiki/Unix_time#Leap_seconds&quot;&gt;requires Unix Time to be discontinuous around a leap
  second&lt;/a&gt;.
  To avoid this complication, &lt;a href=&quot;https://developers.google.com/time/smear&quot;&gt;some systems have started
  “smearing” seconds around the leap second
  instead&lt;/a&gt;, so that
  every day is still exactly 86,400 seconds long, and it’s merely
  the &lt;em&gt;length&lt;/em&gt; of each second that changes. This deviates from
  UTC, but solves lots of practical problems. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Round-to-zero was standardized in C99. &lt;a href=&quot;https://stackoverflow.com/a/3604984/77070&quot;&gt;Before that the
  results for negative numbers were
  implementation-defined&lt;/a&gt;). &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Tue, 12 May 2020 00:00:00 +0000</pubDate>
        <link>https://blog.reverberate.org/2020/05/12/optimizing-date-algorithms.html</link>
        <guid isPermaLink="true">https://blog.reverberate.org/2020/05/12/optimizing-date-algorithms.html</guid>
        
        
      </item>
    
    
      <item>
        <title>Bloaty McBloatface 1.0</title>
        <description>&lt;p&gt;Today I am releasing &lt;a href=&quot;https://github.com/google/bloaty/releases/tag/v1.0&quot;&gt;Bloaty McBloatface 1.0&lt;/a&gt;.
Bloaty is a size profiler for binaries.  It helps you peek into
ELF/Mach-O binaries to see what is taking up space inside.&lt;/p&gt;

&lt;p&gt;Bloaty has gotten lots new features, bugfixes, and overall improvements since
&lt;a href=&quot;http://blog.reverberate.org/2016/11/07/introducing-bloaty-mcbloatface.html&quot;&gt;I announced it in
2016&lt;/a&gt;.
I listed these changes briefly on &lt;a href=&quot;https://github.com/google/bloaty/releases/tag/v1.0&quot;&gt;the release
page&lt;/a&gt;, but I wanted to go
into a bit more detail here.&lt;/p&gt;

&lt;h1 id=&quot;improving-data-quality&quot;&gt;Improving Data Quality&lt;/h1&gt;

&lt;p&gt;Perhaps the biggest overall improvement to Bloaty is its data quality.
When I first announced Bloaty, I got very understandable complaints
&lt;a href=&quot;https://news.ycombinator.com/item?id=13917242&quot;&gt;like this one&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;I ran it and it gives an awful lot of “[None]”:&lt;/p&gt;
  &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;    $ ~/d/bloaty/bloaty builder/virt-builder -d compileunits
         VM SIZE                            FILE SIZE
     --------------                      --------------
      75.5%  1.96Mi [None]                3.67Mi  85.2%
       8.7%   232Ki guestfs-c-actions.c    232Ki   5.3%
       8.2%   219Ki guestfs.ml             219Ki   5.0%
       2.0%  52.4Ki [Other]               52.4Ki   1.2%
       1.3%  33.7Ki _none_                33.7Ki   0.8%
       0.7%  17.5Ki customize_cmdline.ml  17.5Ki   0.4%
       0.6%  17.3Ki builder.ml            17.3Ki   0.4%
       0.4%  11.8Ki customize_run.ml      11.8Ki   0.3%
       0.4%  10.4Ki cmdline.ml            10.4Ki   0.2%
       0.3%  7.08Ki firstboot.ml          7.08Ki   0.2%
       0.2%  6.21Ki index-scan.c          6.21Ki   0.1%
       0.2%  5.90Ki index_parser.ml       5.90Ki   0.1%
       0.2%  5.15Ki sigchecker.ml         5.15Ki   0.1%
       0.2%  4.87Ki getopt-c.c            4.87Ki   0.1%
    [...]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;
  &lt;p&gt;It’s a mixed OCaml/C executable, but I ran it on a build from the local directory and all debug symbols are still available.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Indeed, a profiler tool that has no idea what to say about 85.2%
of the binary is not going to be very useful.  This was Bloaty’s
biggest weakness when I first released it.&lt;/p&gt;

&lt;p&gt;At first I misunderstood the nature of this problem.
Bloaty’s design at the time was simple: it was reading
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.debug_aranges&lt;/code&gt; to assign ranges of the binary to
compilation units.  DWARF’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.debug_aranges&lt;/code&gt; section is an
{address range -&amp;gt; compileunit} map that debuggers use to
decide what compile unit a given function or data variable
is from, given its address.&lt;/p&gt;

&lt;p&gt;The output above indicates that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.debug_aranges&lt;/code&gt; was only
covering about 15% of the binary.  What gives?  My theory at
the time was that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.debug_aranges&lt;/code&gt; should theoretically be
covering the whole binary, but was pretty incomplete for
some reason.  It seemed like a compiler problem that I was
going to have to work around somehow.&lt;/p&gt;

&lt;p&gt;Later I realized that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.debug_aranges&lt;/code&gt; is only meant for
identifying addresses of &lt;em&gt;functions or data&lt;/em&gt;.  Large portions
of the binary are not functions or program data!  For example,
ELF/Mach-O binaries have all sort of stuff in them like:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;symbol tables&lt;/li&gt;
  &lt;li&gt;relocations&lt;/li&gt;
  &lt;li&gt;debug information&lt;/li&gt;
  &lt;li&gt;unwind information&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To get better results, I needed to find a way to break down
these sections.  I needed to determine which parts of the
unwind information, for example, I could attribute to each
function.&lt;/p&gt;

&lt;p&gt;To achieve this, I had to parse the binary more thoroughly
than I had before.  I had to learn to parse unwind
information (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.eh_frame&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.eh_frame_hdr&lt;/code&gt; sections),
which is a really esoteric and low-level thing to be doing.
I’ll quote &lt;a href=&quot;https://github.com/google/bloaty/blob/a03e47a05db1815f1c98a22523d4373bbcb1d08e/src/dwarf.cc#L1738-L1760&quot;&gt;my comment in the code&lt;/a&gt; about how tricky this
is to do correctly:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;// Code to read the .eh_frame section.  This is not technically DWARF, but it
// is similar to .debug_frame (which is DWARF) so it&apos;s convenient to put it
// here.
//
// The best documentation I can find for this format comes from:
//
// * http://refspecs.linuxfoundation.org/LSB_5.0.0/LSB-Core-generic/LSB-Core-generic/ehframechpt.html
// * https://www.airs.com/blog/archives/460
//
// However these are both under-specified.  Some details are not mentioned in
// either of these (for example, the fact that the function length uses the FDE
// encoding, but always absolute).  libdwarf&apos;s implementation contains a comment
// saying &quot;It is not clear if this is entirely correct&quot;.  Basically the only
// thing you can trust for some of these details is the code that actually
// implements unwinding in production:
//
// * libunwind http://www.nongnu.org/libunwind/
//   https://github.com/pathscale/libunwind/blob/master/src/dwarf/Gfde.c
// * LLVM libunwind (a different project!!)
//   https://github.com/llvm-mirror/libunwind/blob/master/src/DwarfParser.hpp
// * libgcc
//   https://github.com/gcc-mirror/gcc/blob/master/libgcc/unwind-dw2-fde.c
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Once I implemented this parser, I could attribute the
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.eh_frame&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.eh_frame_hdr&lt;/code&gt; sections properly, and they
would no longer show up as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[None]&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I did the same thing for all the different kinds of DWARF
debug info, for the symbol/string table and relocations.
All of these are somewhat easier since they at least have
clear standards that describe them.&lt;/p&gt;

&lt;p&gt;But even that wasn’t enough.  After implementing all of the
above, I still found that some parts of the data section
don’t have symbol table entries or debug info at all.  Data
like string constants or other anonymous data can resist
being properly analyzed and attributed.  To combat this,
Bloaty will actually disassemble the binary looking for
references to the data section.  If a function references
part of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.data&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.rodata&lt;/code&gt;, then we can attribute that
part of the binary to the function that references it.&lt;/p&gt;

&lt;p&gt;This was hard and detailed work, but it paid off.  We can
see the fruits of this labor if we do a hierarchical
profile:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ ./bloaty bloaty -d compileunits,sections
     VM SIZE                                                      FILE SIZE
 --------------                                                --------------
  44.9%  2.07Mi [136 Others]                                    8.72Mi  33.7%
   6.0%   281Ki protobuf/src/google/protobuf/descriptor.cc      4.07Mi  15.7%
       0.0%       0 .debug_str                                      1.16Mi  28.4%
       0.0%       0 .debug_info                                     1.01Mi  24.8%
       0.0%       0 .debug_loc                                       766Ki  18.4%
       0.0%       0 .debug_pubnames                                  383Ki   9.2%
      69.6%   195Ki .text                                            195Ki   4.7%
       0.0%       0 .debug_line                                      177Ki   4.3%
       0.0%       0 .debug_pubtypes                                  158Ki   3.8%
       0.0%       0 .debug_ranges                                    131Ki   3.2%
       0.0%       0 .strtab                                         44.6Ki   1.1%
      14.6%  41.2Ki .dynstr                                         41.2Ki   1.0%
       7.1%  19.8Ki .eh_frame                                       19.8Ki   0.5%
       4.6%  12.8Ki .rodata                                         12.8Ki   0.3%
       0.0%       0 .symtab                                         9.45Ki   0.2%
       3.1%  8.62Ki .dynsym                                         8.62Ki   0.2%
       1.0%  2.79Ki .eh_frame_hdr                                   2.79Ki   0.1%
       0.0%      88 .bss                                                 0   0.0%
   6.5%   306Ki protobuf/src/google/protobuf/descriptor.pb.cc   2.38Mi   9.2%
       0.0%       0 .debug_info                                      660Ki  27.1%
       0.0%       0 .debug_loc                                       620Ki  25.4%
       0.0%       0 .debug_str                                       256Ki  10.5%
       0.0%       0 .debug_pubnames                                  166Ki   6.8%
       0.0%       0 .debug_line                                      163Ki   6.7%
      53.2%   163Ki .text                                            163Ki   6.7%
       0.0%       0 .debug_ranges                                    154Ki   6.3%
       0.0%       0 .strtab                                         71.1Ki   2.9%
      22.3%  68.3Ki .dynstr                                         68.3Ki   2.8%
      10.0%  30.8Ki .eh_frame                                       30.8Ki   1.3%
       0.0%       0 .symtab                                         27.2Ki   1.1%
       8.6%  26.4Ki .dynsym                                         26.4Ki   1.1%
       0.0%       0 .debug_pubtypes                                 17.6Ki   0.7%
       2.5%  7.63Ki .eh_frame_hdr                                   7.63Ki   0.3%
       2.3%  6.91Ki .rodata                                         6.91Ki   0.3%
       1.0%  3.13Ki .bss
  [...]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here we can see that Bloaty has figured out what part
of each section (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.debug_*&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.text&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ehframe&lt;/code&gt;, etc)
it can attribute to each source file.  Bloaty has
constructed a very granular look into this binary, where
each part of the file is attributed to the code that
produced it.&lt;/p&gt;

&lt;p&gt;I generally see 2% or less of the binary attributed to
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[None]&lt;/code&gt; now.  Actually Bloaty never spits out a literal
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[None]&lt;/code&gt; anymore, because if we can’t figure out what
function/compileunit/etc. some part of the binary comes
from, we at least report its section.  So if we’re stumped
by some file range, we’ll report something like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[section
.rodata]&lt;/code&gt; instead of the very unhelpful &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[None]&lt;/code&gt;.&lt;/p&gt;

&lt;h1 id=&quot;debugging-stripped-binaries&quot;&gt;Debugging Stripped Binaries&lt;/h1&gt;

&lt;p&gt;People often want to profile stripped binaries.  Very often
the binaries you ship to customers don’t have full debug
info in them, and you want to profile what you are shipping.
But some of Bloaty’s more useful data sources
(&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;compileunits&lt;/code&gt; especially) require debug information.  What
to do?&lt;/p&gt;

&lt;p&gt;Bloaty now supports &lt;a href=&quot;https://github.com/google/bloaty#debugging-stripped-binaries&quot;&gt;reading symbols and debug info from
separate
files&lt;/a&gt;.
That way you can profile the thing you’re actually trying to
shrink, instead of having your results skewed with the
overhead of debugging information.&lt;/p&gt;

&lt;p&gt;Bloaty uses build IDs to make sure that the debug
information always exactly matches the file you are
profiling.&lt;/p&gt;

&lt;h1 id=&quot;first-class-mach-o-support&quot;&gt;First-class Mach-O Support&lt;/h1&gt;

&lt;p&gt;When Bloaty was first released, it parsed ELF and DWARF
directly, but shelled out to command-line programs to parse
Mach-O.  This was slow and didn’t give us as much info as we
would have liked.  As of Bloaty 1.0, we now have first-class
Mach-O support.  Both fat and single-arch binaries are
supported.&lt;/p&gt;

&lt;p&gt;DWARF is fortunately a cross-platform standard, which means
that Mach-O and ELF can share all of the code that parses
DWARF.  The code to parse DWARF is about the same size as
the ELF and Mach-O parsers combined, so it’s great that so
much of this code can be shared.&lt;/p&gt;

&lt;h1 id=&quot;experimental-webassembly-support&quot;&gt;Experimental WebAssembly Support&lt;/h1&gt;

&lt;p&gt;I am really excited about
&lt;a href=&quot;https://webassembly.org/&quot;&gt;WebAssembly&lt;/a&gt;.  I wanted to learn
more about it, so I wrote a basic parser for Bloaty.  It can
handle sections and functions so far.&lt;/p&gt;

&lt;p&gt;I am excited to see that this has been &lt;a href=&quot;https://twitter.com/FlohOfWoe/status/1021804966871224321&quot;&gt;getting some use
already&lt;/a&gt;!&lt;/p&gt;

&lt;h1 id=&quot;using-bloaty-as-a-presubmit&quot;&gt;Using Bloaty as a Presubmit&lt;/h1&gt;

&lt;p&gt;Some people might wonder how to integrate Bloaty into their
workflow.  One thing I’ve seen that’s very cool is the way
some projects like grpc integrate Bloaty with their pull
requests.  &lt;a href=&quot;https://github.com/grpc/grpc/pull/16139#issuecomment-407849907&quot;&gt;Here is an
example&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This gives quick and useful feedback about how a given
PR will affect the binary size of your artifacts.  For
size-sensitive projects, this is a nice way of keeping tabs
and making sure PR’s don’t cause unexpected or
disproportionate growth.&lt;/p&gt;

&lt;h1 id=&quot;post-10&quot;&gt;Post 1.0&lt;/h1&gt;

&lt;p&gt;Bloaty has become quite capable, but there is always more to
do.  Maybe the biggest thing on my wishlist is PE/COFF
support so people on Windows can benefit.&lt;/p&gt;

&lt;p&gt;I would also like to make Bloaty understand references
between symbols.  This would make it easier to answer
questions like “could I shrink the binary a lot by avoiding
calls to this one particular function?”  It could also show
you the the benefit you could get by compiling with
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-ffunction-sections&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-fdata-sections&lt;/code&gt; if you’re not
doing that already.  These are options that let the linker
strip individual functions if they are unreachable.&lt;/p&gt;

&lt;p&gt;I’d also like to do a better job of mapping inlines.  The
idea of the “inlines” data source is to know if the inlining
of a particular function is bloating your binary a lot.
If it is, maybe it would be helpful to un-inline it.  Right
now the “inlines” data source uses the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.debug_line&lt;/code&gt;
section, which is what a debugger uses to decide what source
file:line to place the cursor on when your problem is
stopped at a given address.  It would be more convenient to
report inlines by function name instead, but &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.debug_line&lt;/code&gt;
doesn’t know anything about functions.  If I get my inlining
info from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.debug_info&lt;/code&gt; instead, I should be able to report
inlines by function instead.&lt;/p&gt;

&lt;p&gt;I’m happy with Bloaty 1.0 and look forward to improving it
further!&lt;/p&gt;
</description>
        <pubDate>Tue, 07 Aug 2018 00:00:00 +0000</pubDate>
        <link>https://blog.reverberate.org/2018/08/07/bloaty-1.0.html</link>
        <guid isPermaLink="true">https://blog.reverberate.org/2018/08/07/bloaty-1.0.html</guid>
        
        
      </item>
    
  </channel>
</rss>
