Thursday, January 3, 2013

Reader Challenge: Optimize the BF JIT

In my previous entry Hello, JIT World: The Joy of Simple JITs, the final example was a JIT for BF. The performance of this JIT beat an optimizing interpreter, but lost to a translation to C (which I compiled with gcc). I speculated that the reason for this was that my JIT didn't have a register allocator (or perform any optimizations at all).

When I asked Mike Pall, author of LuaJIT and DynASM for feedback on the article, he suggested several optimizations that would be much simpler than register allocation. I've implemented a few of these already on my own, and already was able to increase performance by 2-3x. Implement all of them, and the JIT would likely beat the compiler.

The process of implementing these optimizations was extremely fun, so I want to give you, my readers, a chance to try it out before I reveal my answers. I encourage you to start by forking my GitHub repository.

Here are the optimizations Mike suggested (I'll try to give the idea without giving away what you'd optimize these patterns into):
  1. implement a peephole optimizer that optimizes runs of <<<, >>>, +++, and --- (each run can be optimized into a single instruction).
  2. fold PTR movements into the inc/dec/add/sub arguments themselves (use x86 addressing to avoid having to move the pointer between every +/- instruction).
  3. optimize patterns like [-] and [->>>+<<<], etc.
  4. take advantage of flag propagation from add/sub to optimize away cmp instructions when possible (but beware of inc/dec and partial flag stalls (see section 5.8, p.58 in The microarchitecture of Intel, AMD and VIA CPUs)).
These are just ideas -- not all of them are necessary for good performance and there are almost certainly other ideas I haven't listed here. Try it out -- it's fun!

4 comments:

  1. Fun exercise. :)

    Just by implementing (1) and (3) I can have jitdemo run faster than bf2c/gcc -- jitdemo takes 1.06s while bf2c/gcc takes 1.18s on my machine.

    https://github.com/sanjoy/jitdemo/commits/master

    ReplyDelete
    Replies
    1. Very nice! :)

      By the way, DynASM lets you use arbitrary C expressions as assembly operands. So you don't have to create a temporary variable "total" (if you don't want to), you can put count_prefix() directly into the add/sub instructions.

      Delete
    2. I think I've discovered why one of your optimizations was causing crashes. Remember that in BF a loop is skipped entirely if the cell starts as "0", so when you're optimizing [-<<<+>>>] the cell three to the left may actually be out-of-bounds if the current cell is zero!

      Unfortunately I found that the fix here (inserting a test/jump first) degraded performance a bit.

      Delete
  2. Okay, I had a lot of fun with this one: https://github.com/sanjoy/bfjit

    In a nutshell:

    BF => generate optimized bytecode => interpret bytecode for a while => detect hot loops => compile hot loops

    ReplyDelete