Friday, February 20, 2009

one malloc to rule them all

Came across this on Reddit today:
TLSF is a general purpose dynamic memory allocator specifically designed to meet real-time requirements:



  • Bounded Response Time . The worst-case execution time (WCET) of memory allocation and deallocation has got to be known in advance and be independent of application data. TLSF has a constant cost O(1).
  • Fast. Additionally to a bounded cost, the allocator has to be efficient and fast enough. TLSF executes a maximum of 168 processor instructions in a x86 architecture. Depending on the compiler version and optimisation flags, it can be slightly lower or higher.
  • Efficient Memory Use . Traditionally, real-time systems run for long periods of time and some (embedded applications), have strong constraints of memory size. Fragmentation can have a significant impact on such systems. It can increase dramatically, and degrade the system performance. A way to measure this efficiency is the memory fragmentation incurred by the allocator. TLSF has been tested in hundreds of different loads (real-time tasks, general purpose applications, etc.) obtaining an average fragmentation lower than 15 %. The maximum fragmentation measured is lower than 25%.


So put another way, this malloc is fast, bounded in its response, and memory-efficient! Apparently it has no downsides. Just what I like to hear!

That's the crazy thing about malloc implementations. They all claim to be awesome in every way. So the burning question is: is this better than what I'm using today? I don't know, but I'd sure like to!

I think that malloc() is an under-appreciated source of slowness in many applications these days. Apple's fantastic profiler Shark even has a malloc() tracing mode to help discover malloc-related sources of slowness in a program. From the link above:
In todayĆ¢€™s large and complex software applications, it is often informative to understand the scope and pattern of memory allocations and deallocations. If you understand how your software objects are being created and destroyed, you can often improve the overall performance of your application significantly, since memory allocation is a very expensive operation in Mac OS X, especially for large blocks of memory. If your program suffers from memory leaks, Malloc Trace is also a good way to identify locations in your program that may be the culprits.


I wish I had real data about this, but I strongly suspect that the reason that Firefox (for example) slows down the longer it's open is because malloc() calls start taking longer and longer. Have you ever noticed that after you've had Firefox open for a long time, even just opening a new blank tab takes a second or two? What does opening a new tab have to do but allocate memory?

So personally I'd love to be running a real-time and O(1) malloc() on my desktop. And the web page claims that it's not only real-time but "real fast," which is even better.

This made me think of Jason Evans, author of jemalloc, which he wrote for FreeBSD to scale better across multiprocessors (since malloc() can be a source of contention). He was also working with the Firefox guys to have Firefox 3 use jemalloc, though I'm not sure if that happened or not.

In any case, Jason Evans seems to have come to the conclusion mid-last-year that using red-black trees (an O(lg n) data structure) in jemalloc was a mistake and that he should have gone with O(1) (as TLSF above does). And it sounds like he eventually changed jemalloc to use O(1) algorithms instead. From his blog entry Overzealous use of my red-black tree hammer:
Once Firefox was successfully using jemalloc for all memory allocation, I started doing performance tests. Memory usage differences were minor, but jemalloc was consistently slower than OS X's allocator. It took a lot of profiling for me to finally accept the hard truth: jemalloc was spending way too much time manipulating red-black trees. My first experimental solution was to replace red-black trees with treaps. However, this made little overall difference. So, the problem was too many tree operations, not slow tree operations.

After a bit of code review, it became clear that when I fixed a page allocation bottleneck earlier this year, I was overzealous with the application of red-black trees. It is possible to use constant-time algorithms based on linear page map data structures for splitting/coalescing sequential runs of pages, but I had re-coded these operations entirely using red-black trees. So, I enhanced the page map data structures to support splitting/coalescing, and jemalloc became markedly faster. For example, Firefox sped up by as much as ~10% on JavaScript-heavy benchmarks. (As a side benefit, memory usage went down by 1-2%).

In essence, my initial failure was to disregard the difference between a O(1) algorithm and a O(lg n) algorithm. Intuitively, I think of logarithmic-time algorithms as fast, but constant factors and large n can conspire to make logarthmic time not nearly good enough.


I'm not sure if the changes he made mean that jemalloc is now real-time, but I would be interested to know. I would also be quite interested to see benchmarks between all the mallocs floating around. The ones I know of are:

  • TLSF

  • jemalloc

  • dlmalloc (Doug Lea's malloc, been around forever)

  • ptmalloc2 (Doug Lea's malloc, extended to support per-thread arenas. Default allocator for glibc2.3?)

  • TCMalloc (Google's malloc, claims to be 6x faster than ptmalloc2)

  • nedmalloc (claims to be faster than tcmalloc)

  • Hoard, also claims to be very fast



I think an open-source malloc benchmark that compared all of these under tons of different scenarios would be so so awesome to have. Everyone claims their malloc is the fastest. What is the truth? It could be that different implementations excel in different scenarios. After all, malloc performance has several dimensions:


  1. size of the allocations

  2. how the allocations are distributed across threads (for thread-safe allocators)

  3. current size of the heap

  4. max size the heap has had

  5. number of operations that have been performed (eg. how long has the app been open?)



And consequently there are several factors to measure:

  1. speed of each operation (obviously)

  2. worst-case performance

  3. memory fragmentation

  4. locality (if allocations that happen closely in time are also close in space, they will be more cache-friendly)



So dear college students who are interested in systems and want to do something awesome: create an awesome, open-source malloc benchmark suite!

4 comments:

  1. You forgot about overhead per allocation. To quote google's page:

    Another benefit of TCMalloc is space-efficient representation of small objects. For example, N 8-byte objects can be allocated while using space approximately 8N * 1.01 bytes. I.e., a one-percent space overhead. ptmalloc2 uses a four-byte header for each object and (I think) rounds up the size to a multiple of 8 bytes and ends up using 16N bytes.

    I didn't find any info on TSLF's page on overhead, but I think its important, especially for dynamic languages where the VM makes lots of small allocations

    ReplyDelete
  2. Good point Matt, though you make me realize I also forgot up-front overhead (6MB for tcmalloc) and how fast memory is returned to the OS (tcmalloc is not very good about this).

    ReplyDelete
  3. You are assuming that the faults are not in Firefox. Firefox has lots of problems, including managing memory, unnecessary IO, etc., etc.

    ReplyDelete
  4. [...] a fascinating discussion regarding different types of malloc()s—and how they all effectively promise the same [...]

    ReplyDelete