Parsing framework status
It’s time for me to post an update about how my parsing engine is going. In a word: very well!
I’ve decided to use Lua to implement the parts that aren’t on the critical path. All the up-front work of creating NFAs, DFAs, and parsing tables isn’t on the critical path; it’s one-time overhead. Since Lua is pretty fast, garbage-collected, and quite small (~150k compiled), I think implementing these parts in it is a smart choice. Also, I can compile the Lua to byte-code and stick that in a static character array in my C library, so my library will still be entirely self-contained. It won’t be obvious to the engine’s users that Lua is involved at all.
So what have I implemented? So far, I have a large part of the lexer complete. I have written:
- NFA construction. This lets you build up NFAs by composing primitives like concatenation, alternation, and kleene star. So for example, to build an NFA for the regex ab* you could call concat(char("a"), kleene(char("b")). (Check out the code with beautiful inline comments with ASCII-art diagrams).
- NFA-to-DFA conversion. This takes advantage of the well-known result that NFAs can be simulated with DFAs. There is a bit of complication in the fact that for a lexer, you want to match any of several NFAs simultaneously (because any token could come next), and you want to know what token matched (so the final states must carry extra information).
- DFA minimization: Given the DFA produced in the NFA-to-DFA conversion, create an equivalent DFA with the minimum number of states. It turns out that the best-known algorithm for doing this (Hopcroft’s algorithm, published in 1971, which is O(n lg n)) is not very well known, and not presented in any of the textbooks I’ve been using. Even the illustrious “Dragon Book” gives the more straightforward O(n^2) (at least — I haven’t analyzed it) algorithm. The textbook Engineering a Compiler claims to give Hopcroft’s algorithm, but also gives the more straightforward and expensive variant. To figure out how to implement this algorithm, I had to look in Hopcroft’s original 1971 paper. I also went to the UW library and photocopied a 1972 article by David Gries that explains the algorithm in a bit more digestible way.
This minimization algorithm was a bit of an adventure — I don’t know if I’ve ever had to reach back more than 30 years into the literature to find the state of the art.
- a recursive-descent parser for regular expressions: it handles most of the basic things you expect from regular expressions. For now I’m going with mostly Perl-compatible syntax. The one thing I’m intentionally diverging on (for now) is that spaces are insignificant; you use \s when you want to match a space. Regular expressions are already hard enough to read — authors should at least have whitespace as a tool to make them more readable.
The code is in my public Mercurial repository. Here’s my favorite part. Total code size so far: 1156 lines (not including unit tests or code that dumps the DFA into graphviz format).
Here is an example of it in action. Here are minimal DFAs for parsing JSON. There are two: one for when you’re inside a string, and one for the rest of the time.
I created these by feeding files with regexes into this program, which dumps a graph in dot format.
require "regex_parser" require "nfa_to_dfa" require "sketches/regex_debug" require "minimize" nfas = {} while true do local line = io.read() local label = io.read() if line == nil then break end nfa = parse_regex(TokenStream:new(line)) table.insert(nfas, {nfa, label}) end dfa = nfas_to_dfa(nfas) minimal_dfa = hopcroft_minimize(dfa) print(minimal_dfa)
Ever see this? It’s cool.
http://osteele.com/tools/reanimator/
That’s quite awesome. Thanks for Graphviz! It’s been really helpful.
I don’t really know Lua so I apologize if I am reading this completely wrongly, but in the concat function in nfa_construct.lua it looks like any existing epsilon transitions from the final state of nfa1 are overwritten when you append nfa2 to it.
I guess you are assuming that nfa1 has no e-transitions from the final state, but I am not sure if that is a valid assumption to make. You could have reached the final state but not yet have consumed all the input; an e-transition back to an earlier state might still get you to the final state at the end of the input.
Likewise for the alt and rep functions, although you do a table.insert in alt for new_nfa because you know it has multiple e-transitions from the start state.
Again, I may be completely off track here; it has been a while since I looked at NFAs and I don’t know anything about your project.
Cheers!
-K
Hey Kaushik, thanks for writing. While it’s true that an arbitrary NFA could have epsilon transitions out of the final state, it is an invariant of the NFA construction I used that no final state has any transitions out of it. So the algorithms in nfa_construct.lua will work for any NFA that was generated using that construction. Perhaps I should document this assumption.