Thursday, January 3, 2013

Reader Challenge: Optimize the BF JIT

In my previous entry Hello, JIT World: The Joy of Simple JITs, the final example was a JIT for BF. The performance of this JIT beat an optimizing interpreter, but lost to a translation to C (which I compiled with gcc). I speculated that the reason for this was that my JIT didn't have a register allocator (or perform any optimizations at all).

When I asked Mike Pall, author of LuaJIT and DynASM for feedback on the article, he suggested several optimizations that would be much simpler than register allocation. I've implemented a few of these already on my own, and already was able to increase performance by 2-3x. Implement all of them, and the JIT would likely beat the compiler.

The process of implementing these optimizations was extremely fun, so I want to give you, my readers, a chance to try it out before I reveal my answers. I encourage you to start by forking my GitHub repository.

Here are the optimizations Mike suggested (I'll try to give the idea without giving away what you'd optimize these patterns into):
  1. implement a peephole optimizer that optimizes runs of <<<, >>>, +++, and --- (each run can be optimized into a single instruction).
  2. fold PTR movements into the inc/dec/add/sub arguments themselves (use x86 addressing to avoid having to move the pointer between every +/- instruction).
  3. optimize patterns like [-] and [->>>+<<<], etc.
  4. take advantage of flag propagation from add/sub to optimize away cmp instructions when possible (but beware of inc/dec and partial flag stalls (see section 5.8, p.58 in The microarchitecture of Intel, AMD and VIA CPUs)).
These are just ideas -- not all of them are necessary for good performance and there are almost certainly other ideas I haven't listed here. Try it out -- it's fun!


  1. Fun exercise. :)

    Just by implementing (1) and (3) I can have jitdemo run faster than bf2c/gcc -- jitdemo takes 1.06s while bf2c/gcc takes 1.18s on my machine.

    1. Very nice! :)

      By the way, DynASM lets you use arbitrary C expressions as assembly operands. So you don't have to create a temporary variable "total" (if you don't want to), you can put count_prefix() directly into the add/sub instructions.

    2. I think I've discovered why one of your optimizations was causing crashes. Remember that in BF a loop is skipped entirely if the cell starts as "0", so when you're optimizing [-<<<+>>>] the cell three to the left may actually be out-of-bounds if the current cell is zero!

      Unfortunately I found that the fix here (inserting a test/jump first) degraded performance a bit.

  2. Okay, I had a lot of fun with this one:

    In a nutshell:

    BF => generate optimized bytecode => interpret bytecode for a while => detect hot loops => compile hot loops

    1. Whoa, I must have missed this before. Way to go all out on this. :) What advantage do you gain by waiting to compile hot loops? Can you somehow optimize them better than if you compiled them up front?

    2. No specific advantage for compiling only the hot loops -- just did it for fun. :)

  3. I added some optimizations using Flex to identify patterns, and got a 3.8x speedup on the mandelbrot benchmark --

    1. Very nice! I wouldn't have thought to use flex to recognize optimizable patterns. :) It crashes for me on OS X but I'll have to try it on Linux.
      You may have tried this already, but where you have:

      mov al, byte [PTR]

      Usually you want to write this instead to avoid a partial register stall (also in case there's junk sitting in the register).

      movzx eax, byte [PTR]