When I asked Mike Pall, author of LuaJIT and DynASM for feedback on the article, he suggested several optimizations that would be much simpler than register allocation. I've implemented a few of these already on my own, and already was able to increase performance by 2-3x. Implement all of them, and the JIT would likely beat the compiler.
The process of implementing these optimizations was extremely fun, so I want to give you, my readers, a chance to try it out before I reveal my answers. I encourage you to start by forking my GitHub repository.
Here are the optimizations Mike suggested (I'll try to give the idea without giving away what you'd optimize these patterns into):
- implement a peephole optimizer that optimizes runs of
---(each run can be optimized into a single instruction).
PTRmovements into the
inc/dec/add/subarguments themselves (use x86 addressing to avoid having to move the pointer between every
- optimize patterns like
- take advantage of flag propagation from
add/subto optimize away
cmpinstructions when possible (but beware of
inc/decand partial flag stalls (see section 5.8, p.58 in The microarchitecture of Intel, AMD and VIA CPUs)).