I’m digging back into Gazelle after a several-month dormancy. If you happen to be someone who syndicates my commits, either directly or via your GitHub news feed (don’t laugh — I have 18 “watchers” on GitHub!) you’ll notice a flurry of commits from the last few days. I’ve been working on whitespace handling, and progress is refreshingly brisk: I have things pretty well working after only a few days of work.
Whitespace handling was for a long time the biggest thing preventing Gazelle from being actually useful. I had a whitespace-handling mechanism back in 0.1, but had to rip it out in 0.2 because it was naive and premature. It’s back now, and while it looks mostly the same from the user’s perspective, it’s totally different under the hood.
Here’s the current picture of a JSON grammar, using the new whitespace facility:
// Grammar for JSON, as defined at json.org.
@start object;
object -> "{" (string ":" value) *(,) "}";
array -> "[" value *(,) "]";
str_frag -> .chars=/[^\\"]+/ |
.unicode_char=/\\u ([0-9A-Fa-f]{4})/ |
.backslash_char=/\\[ntbr"\/\\]/;
string -> '"' str_frag* '"';
number -> .sign="-"?
.integer=/ 0 | ([1-9][0-9]*) /
.decimal=/ \. [0-9]+ /?
.exponent=/ [eE] [+-]? [0-9]+ /? ;
value -> string | number | "true" | "false" | "null" | object | array;
whitespace -> .whitespace=/[\r\n\s\t]+/;
@allow whitespace object ... number, string;
It’s the @allow statement at the end that triggers the whitespace facility (which are more generally called “subparsers”). That statement means that any time the parser is inside an object (eg. all the time), but not inside a number or a string, whitespace can appear anywhere. More specifically, for all states in such rules, the @allow statement defines a “whitespace” transition back to the same state.
As a result, the rule graphs start looking like this:
All the whitespace transitions are a little bit annoying to look at, and I’ll definitely have to tweak the visualization to show them in a nicer way. But this gives you an idea for what’s going on.
What’s nice about this scheme is that you get to specify the whitespace rules in a simple way (“whitespace is allowed everywhere except in these certain rules”), but then the whitespace becomes a part of the same grammar data structure and is analyzed by the same algorithms as everything else.
It’s worth mentioning that this design is significantly different than the traditional design, which is to have a separate lexer simply discard whitespace and comments before they ever hits the parser. The lovely thing about doing it my way is that the whitespace becomes a part of the parse tree (if you want it to — you’ll be able to turn this off). This means:
- round-trips from text -> parse tree -> text are lossless
- therefore programs can operate on the parse tree to do whitespace-preserving transformations
- programs can also analyze the parse tree with regard to how it follows style rules
- programs like doxygen that consider comments and whitespace significant could operate on a Gazelle parse tree