In my previous entry Hello, JIT World: The Joy of Simple JITs, the final example was a JIT for BF. The performance of this JIT beat an optimizing interpreter, but lost to a translation to C (which I compiled with gcc). I speculated that the reason for this was that my JIT didn’t have a register allocator (or perform any optimizations at all).
When I asked Mike Pall, author of LuaJIT and DynASM for feedback on the article, he suggested several optimizations that would be much simpler than register allocation. I’ve implemented a few of these already on my own, and already was able to increase performance by 2-3x. Implement all of them, and the JIT would likely beat the compiler.
The process of implementing these optimizations was extremely fun, so I want to give you, my readers, a chance to try it out before I reveal my answers. I encourage you to start by forking my GitHub repository.
Here are the optimizations Mike suggested (I’ll try to give the idea without giving away what you’d optimize these patterns into):
- implement a peephole
optimizer that optimizes runs of
---(each run can be optimized into a single instruction).
PTRmovements into the
inc/dec/add/subarguments themselves (use x86 addressing to avoid having to move the pointer between every
- optimize patterns like
- take advantage of flag propagation from
add/subto optimize away
cmpinstructions when possible (but beware of
inc/decand partial flag stalls (see section 5.8, p.58 in The microarchitecture of Intel, AMD and VIA CPUs)).
These are just ideas – not all of them are necessary for good performance and there are almost certainly other ideas I haven’t listed here. Try it out – it’s fun!