The grammars Gazelle can handle
Posted by josh at August 6th, 2008
I wanted to follow-up a bit more on my recent posts about the grammars Gazelle and competing parsing frameworks can handle. Here is an Euler diagram showing my current understanding of what grammars Gazelle can handle compared with the well-known Strong-LL language classes as well as competing parsing frameworks.

Gazelle can currently generate lookahead for any LL(k) grammar, all LL(*) grammars that ANTLR can handle, and a few LL(*) grammars (namely tail-recursive ones) that ANTLR cannot. ANTLR can handle all LL(k) grammars and many LL(*) ones. SLK can handle only LL(k).
Since my last entry, I gave Gazelle the capability to correctly generate lookahead for all LL(k) languages, as long as the user specifies a maximum k. This is in-line with what SLK and ANTLR do, since the problem is undecidable in general. By default I still use my heuristic that I think will detect in most cases whether a grammar will be LL(*) or LL(k) without making the user specify a maximum k, but this heuristic will be foiled in some cases. So if the user really thinks their grammar is LL(k) even though it failed the heuristic, they can specify a maximum “k” and Gazelle will keep searching until it hits that “k”.
Just in case I wasn’t clear enough in my last entry, I strongly dispute the claims on the SLK website that SLK is the only framework that does true Strong-LL(k) analysis. As you can see from the diagram, both ANTLR and Gazelle can handle not only LL(k) but a superset thereof. Current and potential SLK customers: don’t be fooled by SLK’s claims! They haven’t been true for years — ANTLR has been able to handle LL(k) for a long time!
Here are some examples of grammars that fall within the different regions of that Euler diagram:
Strong-LL(*) but can’t be handled by Gazelle or ANTLR:
s: a | b;
a -> "(" a ")" | "X";
b -> "(" b ")" | "Y";
Can be handled by Gazelle but not ANTLR:
s -> a "X" | a "Y";
a -> ("Z" a)?;
Can be handled by ANTLR but not SLK (and is not Strong-LL(k)):
s -> a "X" | a "Y";
a -> "Z"*;
A note about generalized parsing
When I have talked about what grammars ANTLR can “handle,” I’m specifically talking about the grammars it can generate LL(*) lookahead for, without the help of syntactic or semantic predicates. Put more plainly, these are the grammars that ANTLR can parse efficiently and without any hints from the user.
Why the caveats? Well you can put ANTLR and comparable tools in a mode where it parses by depth-first search, using backtracking to abandon search attempts that did not work out. This is equivalent to what most regexp implementations (Perl, Python, Ruby, etc) do, and as Russ Cox has demonstrated this technique has severe (exponential) worst case performance. If you want to trade space for time, you can memoize this depth-first search and get linear time at the cost of some space. While I think these techniques are totally legitimate as a way of expanding the grammars the tool can handle, I personally am singularly focused on parsing in linear time with very low space overhead. I am doing everything I possibly can to parse real languages (everything from JSON to C++) without going outside these bounds. That’s what’s going to make it practical to use Gazelle grammars for things like syntax highlighting in text editors.
Secondly, ANTLR has features that let the user help the parser out with its decisions, either by writing Java code (semantic predicates) or by specifying syntax fragments that the user promises will properly disambiguate alternatives (syntactic predicates). It’s always useful to have more tools of this sort, and Gazelle will certainly include some with time, but again this is an apples to oranges comparison.
Finally, Adrian Thurston (author of the popular Ragel state machine compiler) mentioned on one of my previous entries that there exist general (eg. can handle any grammar) algorithms for top-down parsing, which are automatically more powerful than Gazelle. While this is true, none of them are what I consider practical. Gazelle’s time/space complexity are:
Time: linear
(actually LL(*) could probably become quadratic in degenerate cases, but LL(k) is definitely linear and LL(*) is too, for all intents and practical purposes).
Space: max stack depth + max lookahead depth
Both can be capped. In a bit more detail:
- The maximum stack depth represents the nesting level (for example, number of nested parentheses).
- The maximum lookahead depth represents the number of tokens we have to look ahead in the worst case to disambiguate alternatives.
- Input text that doesn’t respect the maximum depth limits will be rejected (even if it is valid according to the grammar), but that’s the only way to prevent malicious input text from consuming arbitrary amounts of memory. Clients that don’t mind getting DoS’d can just set these limits to unlimited.
No general method can approach this space/time complexity. And I think it is for this reason that no tool (AFAIK) which is aimed at practical use implements a generalized top-down method.
I’d like to post a slightly more detailed run-down of the generalized methods Adrian mentioned when I have the time to take a closer look at them.
So to make a long story short, I stand by my previous claim that Gazelle is (at least briefly) the most powerful practical top-down parsing framework there is.