Today I am releasing Bloaty McBloatface 1.0. Bloaty is a size profiler for binaries. It helps you peek into ELF/Mach-O binaries to see what is taking up space inside.

Bloaty has gotten lots new features, bugfixes, and overall improvements since I announced it in 2016. I listed these changes briefly on the release page, but I wanted to go into a bit more detail here.

Improving Data Quality

Perhaps the biggest overall improvement to Bloaty is its data quality. When I first announced Bloaty, I got very understandable complaints like this one:

I ran it and it gives an awful lot of “[None]”:

    $ ~/d/bloaty/bloaty builder/virt-builder -d compileunits
         VM SIZE                            FILE SIZE
     --------------                      --------------
      75.5%  1.96Mi [None]                3.67Mi  85.2%
       8.7%   232Ki guestfs-c-actions.c    232Ki   5.3%
       8.2%   219Ki             219Ki   5.0%
       2.0%  52.4Ki [Other]               52.4Ki   1.2%
       1.3%  33.7Ki _none_                33.7Ki   0.8%
       0.7%  17.5Ki  17.5Ki   0.4%
       0.6%  17.3Ki            17.3Ki   0.4%
       0.4%  11.8Ki      11.8Ki   0.3%
       0.4%  10.4Ki            10.4Ki   0.2%
       0.3%  7.08Ki          7.08Ki   0.2%
       0.2%  6.21Ki index-scan.c          6.21Ki   0.1%
       0.2%  5.90Ki       5.90Ki   0.1%
       0.2%  5.15Ki         5.15Ki   0.1%
       0.2%  4.87Ki getopt-c.c            4.87Ki   0.1%

It’s a mixed OCaml/C executable, but I ran it on a build from the local directory and all debug symbols are still available.

Indeed, a profiler tool that has no idea what to say about 85.2% of the binary is not going to be very useful. This was Bloaty’s biggest weakness when I first released it.

At first I misunderstood the nature of this problem. Bloaty’s design at the time was simple: it was reading .debug_aranges to assign ranges of the binary to compilation units. DWARF’s .debug_aranges section is an {address range -> compileunit} map that debuggers use to decide what compile unit a given function or data variable is from, given its address.

The output above indicates that .debug_aranges was only covering about 15% of the binary. What gives? My theory at the time was that .debug_aranges should theoretically be covering the whole binary, but was pretty incomplete for some reason. It seemed like a compiler problem that I was going to have to work around somehow.

Later I realized that .debug_aranges is only meant for identifying addresses of functions or data. Large portions of the binary are not functions or program data! For example, ELF/Mach-O binaries have all sort of stuff in them like:

  • symbol tables
  • relocations
  • debug information
  • unwind information

To get better results, I needed to find a way to break down these sections. I needed to determine which parts of the unwind information, for example, I could attribute to each function.

To achieve this, I had to parse the binary more thoroughly than I had before. I had to learn to parse unwind information (.eh_frame and .eh_frame_hdr sections), which is a really esoteric and low-level thing to be doing. I’ll quote my comment in the code about how tricky this is to do correctly:

// Code to read the .eh_frame section.  This is not technically DWARF, but it
// is similar to .debug_frame (which is DWARF) so it's convenient to put it
// here.
// The best documentation I can find for this format comes from:
// *
// *
// However these are both under-specified.  Some details are not mentioned in
// either of these (for example, the fact that the function length uses the FDE
// encoding, but always absolute).  libdwarf's implementation contains a comment
// saying "It is not clear if this is entirely correct".  Basically the only
// thing you can trust for some of these details is the code that actually
// implements unwinding in production:
// * libunwind
// * LLVM libunwind (a different project!!)
// * libgcc

Once I implemented this parser, I could attribute the .eh_frame and .eh_frame_hdr sections properly, and they would no longer show up as [None].

I did the same thing for all the different kinds of DWARF debug info, for the symbol/string table and relocations. All of these are somewhat easier since they at least have clear standards that describe them.

But even that wasn’t enough. After implementing all of the above, I still found that some parts of the data section don’t have symbol table entries or debug info at all. Data like string constants or other anonymous data can resist being properly analyzed and attributed. To combat this, Bloaty will actually disassemble the binary looking for references to the data section. If a function references part of .data or .rodata, then we can attribute that part of the binary to the function that references it.

This was hard and detailed work, but it paid off. We can see the fruits of this labor if we do a hierarchical profile:

$ ./bloaty bloaty -d compileunits,sections
     VM SIZE                                                      FILE SIZE
 --------------                                                --------------
  44.9%  2.07Mi [136 Others]                                    8.72Mi  33.7%
   6.0%   281Ki protobuf/src/google/protobuf/      4.07Mi  15.7%
       0.0%       0 .debug_str                                      1.16Mi  28.4%
       0.0%       0 .debug_info                                     1.01Mi  24.8%
       0.0%       0 .debug_loc                                       766Ki  18.4%
       0.0%       0 .debug_pubnames                                  383Ki   9.2%
      69.6%   195Ki .text                                            195Ki   4.7%
       0.0%       0 .debug_line                                      177Ki   4.3%
       0.0%       0 .debug_pubtypes                                  158Ki   3.8%
       0.0%       0 .debug_ranges                                    131Ki   3.2%
       0.0%       0 .strtab                                         44.6Ki   1.1%
      14.6%  41.2Ki .dynstr                                         41.2Ki   1.0%
       7.1%  19.8Ki .eh_frame                                       19.8Ki   0.5%
       4.6%  12.8Ki .rodata                                         12.8Ki   0.3%
       0.0%       0 .symtab                                         9.45Ki   0.2%
       3.1%  8.62Ki .dynsym                                         8.62Ki   0.2%
       1.0%  2.79Ki .eh_frame_hdr                                   2.79Ki   0.1%
       0.0%      88 .bss                                                 0   0.0%
   6.5%   306Ki protobuf/src/google/protobuf/   2.38Mi   9.2%
       0.0%       0 .debug_info                                      660Ki  27.1%
       0.0%       0 .debug_loc                                       620Ki  25.4%
       0.0%       0 .debug_str                                       256Ki  10.5%
       0.0%       0 .debug_pubnames                                  166Ki   6.8%
       0.0%       0 .debug_line                                      163Ki   6.7%
      53.2%   163Ki .text                                            163Ki   6.7%
       0.0%       0 .debug_ranges                                    154Ki   6.3%
       0.0%       0 .strtab                                         71.1Ki   2.9%
      22.3%  68.3Ki .dynstr                                         68.3Ki   2.8%
      10.0%  30.8Ki .eh_frame                                       30.8Ki   1.3%
       0.0%       0 .symtab                                         27.2Ki   1.1%
       8.6%  26.4Ki .dynsym                                         26.4Ki   1.1%
       0.0%       0 .debug_pubtypes                                 17.6Ki   0.7%
       2.5%  7.63Ki .eh_frame_hdr                                   7.63Ki   0.3%
       2.3%  6.91Ki .rodata                                         6.91Ki   0.3%
       1.0%  3.13Ki .bss

Here we can see that Bloaty has figured out what part of each section (.debug_*, .text, ehframe, etc) it can attribute to each source file. Bloaty has constructed a very granular look into this binary, where each part of the file is attributed to the code that produced it.

I generally see 2% or less of the binary attributed to [None] now. Actually Bloaty never spits out a literal [None] anymore, because if we can’t figure out what function/compileunit/etc. some part of the binary comes from, we at least report its section. So if we’re stumped by some file range, we’ll report something like [section .rodata] instead of the very unhelpful [None].

Debugging Stripped Binaries

People often want to profile stripped binaries. Very often the binaries you ship to customers don’t have full debug info in them, and you want to profile what you are shipping. But some of Bloaty’s more useful data sources (compileunits especially) require debug information. What to do?

Bloaty now supports reading symbols and debug info from separate files. That way you can profile the thing you’re actually trying to shrink, instead of having your results skewed with the overhead of debugging information.

Bloaty uses build IDs to make sure that the debug information always exactly matches the file you are profiling.

First-class Mach-O Support

When Bloaty was first released, it parsed ELF and DWARF directly, but shelled out to command-line programs to parse Mach-O. This was slow and didn’t give us as much info as we would have liked. As of Bloaty 1.0, we now have first-class Mach-O support. Both fat and single-arch binaries are supported.

DWARF is fortunately a cross-platform standard, which means that Mach-O and ELF can share all of the code that parses DWARF. The code to parse DWARF is about the same size as the ELF and Mach-O parsers combined, so it’s great that so much of this code can be shared.

Experimental WebAssembly Support

I am really excited about WebAssembly. I wanted to learn more about it, so I wrote a basic parser for Bloaty. It can handle sections and functions so far.

I am excited to see that this has been getting some use already!

Using Bloaty as a Presubmit

Some people might wonder how to integrate Bloaty into their workflow. One thing I’ve seen that’s very cool is the way some projects like grpc integrate Bloaty with their pull requests. Here is an example.

This gives quick and useful feedback about how a given PR will affect the binary size of your artifacts. For size-sensitive projects, this is a nice way of keeping tabs and making sure PR’s don’t cause unexpected or disproportionate growth.

Post 1.0

Bloaty has become quite capable, but there is always more to do. Maybe the biggest thing on my wishlist is PE/COFF support so people on Windows can benefit.

I would also like to make Bloaty understand references between symbols. This would make it easier to answer questions like “could I shrink the binary a lot by avoiding calls to this one particular function?” It could also show you the the benefit you could get by compiling with -ffunction-sections and -fdata-sections if you’re not doing that already. These are options that let the linker strip individual functions if they are unreachable.

I’d also like to do a better job of mapping inlines. The idea of the “inlines” data source is to know if the inlining of a particular function is bloating your binary a lot. If it is, maybe it would be helpful to un-inline it. Right now the “inlines” data source uses the .debug_line section, which is what a debugger uses to decide what source file:line to place the cursor on when your problem is stopped at a given address. It would be more convenient to report inlines by function name instead, but .debug_line doesn’t know anything about functions. If I get my inlining info from .debug_info instead, I should be able to report inlines by function instead.

I’m happy with Bloaty 1.0 and look forward to improving it further!