Just showed up at the bytecode party
Posted by josh at September 5th, 2007
I know you’re all worried, but let me reassure you: my parsing framework isn’t dead.
I had a goal of releasing a v0.1 by Aug 10. That obviously didn’t happen, and the main culprit is that I found another hobby: sailing. As much as I love sitting alone in my apartment with a computer in my lap (and I actually do love that), sailing is pretty fun too. So there goes all that free time I might have otherwise used to blaze trails in parsing.
But the wind has been crap lately, so this labor day weekend I made some strides and came closer to having something ready to release. To put this news in context, let me show the slightly expanded architecture of the parsing engine:

What’s been added is the “bytecode” step. Let me explain.
The “Grammar Analyzer” stage is what I’ve spend the last several months writing in Lua. It does a series of things to take the input grammar and turn it into something useful:
- it parses the input grammar
- it builds NFAs (recursive ones for grammar rules, nonrecursive ones for any regular expressions)
- it turns those NFAs into DFAs
- it minimizes those DFAs
- it calculates lookahead — lookahead answers the question “what grammar transitions should I take when I see a certain terminal?”
So after all this analysis, what you’re left with is a bunch of DFAs, and parsing is really just a matter of running those dfas on the input text. That’s why the actual parser runtime will be so simple, small, and fast — all the analysis happened beforehand.
The grammar analyzer stage is more or less ready for a rough v0.1 release. It can output those DFAs, and I wrote a little Lua runtime to interpret them, which can parse JSON successfully.
So then it was time to write the runtime. The runtime needs to be in C for speed, which means that I need a way to get those DFAs that Lua spits out into C data structures. Given the non-linear structure of DFAs (states that can have an arbitrary number of transitions to an arbitrary number of other states), this turns out to be kind of complicated.
It dawned on me that I could really use an intermediate form for these DFAs. That intermediate form — some kind of bytecode — can serve as a well-defined interface between the grammar analysis and the parsing runtime. If I can dump the results of grammar analysis to a file, then I can feed it to the parsing runtime later without even needing the grammar analysis. This could also be the key to bootstrapping: if I can create bytecode for the grammar language itself, I can use the parsing engine to parse the input files to the parsing engine.
I keep calling this intermediate representation “bytecode,” but it’s different than most bytecode in an important way: it is declarative, not imperative. It is only describing a data structure (DFAs), not a series of instructions like most bytecode does. The data is high-level enough that it could even be fed into a program that just dumps the DFAs visually.
This is all to say that bytecode seems like a really good idea. On the other hand, I didn’t want to have to engineer a file format, so I looked around and found Bitcode, which is the format LLVM uses to store its bytecode. It is wonderfully application-neutral (not tied to LLVM) and space-efficient (my friends know how OCD I can get about binary sizes). So it looks great. I started writing a pure C implementation of it, since LLVM’s code is STL-using C++. I also posted to the LLVM list asking clarification on some questions, and for permission to add to the good-but-incomplete documentation.
So that’s the current status: I’m working on writing bytecode in a stable, efficient format. Once I have that, I can load it into my C runtime (which is small and simple) and start doing some fast parsing.