When I asked Mike Pall, author of LuaJIT and DynASM for feedback on the article, he suggested several optimizations that would be much simpler than register allocation. I've implemented a few of these already on my own, and already was able to increase performance by 2-3x. Implement all of them, and the JIT would likely beat the compiler.
The process of implementing these optimizations was extremely fun, so I want to give you, my readers, a chance to try it out before I reveal my answers. I encourage you to start by forking my GitHub repository.
Here are the optimizations Mike suggested (I'll try to give the idea without giving away what you'd optimize these patterns into):
- implement a peephole optimizer that optimizes runs of
<<<,>>>,+++, and---(each run can be optimized into a single instruction). - fold
PTRmovements into theinc/dec/add/subarguments themselves (use x86 addressing to avoid having to move the pointer between every+/-instruction). - optimize patterns like
[-]and[->>>+<<<], etc. - take advantage of flag propagation from
add/subto optimize awaycmpinstructions when possible (but beware ofinc/decand partial flag stalls (see section 5.8, p.58 in The microarchitecture of Intel, AMD and VIA CPUs)).
Fun exercise. :)
ReplyDeleteJust by implementing (1) and (3) I can have jitdemo run faster than bf2c/gcc -- jitdemo takes 1.06s while bf2c/gcc takes 1.18s on my machine.
https://github.com/sanjoy/jitdemo/commits/master
Very nice! :)
DeleteBy the way, DynASM lets you use arbitrary C expressions as assembly operands. So you don't have to create a temporary variable "total" (if you don't want to), you can put count_prefix() directly into the add/sub instructions.
I think I've discovered why one of your optimizations was causing crashes. Remember that in BF a loop is skipped entirely if the cell starts as "0", so when you're optimizing [-<<<+>>>] the cell three to the left may actually be out-of-bounds if the current cell is zero!
DeleteUnfortunately I found that the fix here (inserting a test/jump first) degraded performance a bit.
Okay, I had a lot of fun with this one: https://github.com/sanjoy/bfjit
ReplyDeleteIn a nutshell:
BF => generate optimized bytecode => interpret bytecode for a while => detect hot loops => compile hot loops