In my previous entry Hello, JIT World: The Joy of Simple JITs, the final example was a JIT for BF. The performance of this JIT beat an optimizing interpreter, but lost to a translation to C (which I compiled with gcc). I speculated that the reason for this was that my JIT didn’t have a register allocator (or perform any optimizations at all).

When I asked Mike Pall, author of LuaJIT and DynASM for feedback on the article, he suggested several optimizations that would be much simpler than register allocation. I’ve implemented a few of these already on my own, and already was able to increase performance by 2-3x. Implement all of them, and the JIT would likely beat the compiler.

The process of implementing these optimizations was extremely fun, so I want to give you, my readers, a chance to try it out before I reveal my answers. I encourage you to start by forking my GitHub repository.

Here are the optimizations Mike suggested (I’ll try to give the idea without giving away what you’d optimize these patterns into):

  1. implement a peephole optimizer that optimizes runs of <<<, >>>, +++, and --- (each run can be optimized into a single instruction).
  2. fold PTR movements into the inc/dec/add/sub arguments themselves (use x86 addressing to avoid having to move the pointer between every +/- instruction).
  3. optimize patterns like [-] and [->>>+<<<], etc.
  4. take advantage of flag propagation from add/sub to optimize away cmp instructions when possible (but beware of inc/dec and partial flag stalls (see section 5.8, p.58 in The microarchitecture of Intel, AMD and VIA CPUs)).

These are just ideas – not all of them are necessary for good performance and there are almost certainly other ideas I haven’t listed here. Try it out – it’s fun!