A syntactic dilemma (and an intro to Gazelle’s ambiguity resolution)
Posted by josh at August 10th, 2008
As I’m adding more capabilities to the Gazelle grammar description language, I’ve come up against a problem for which I can’t find any solution I like. So I’m putting it to you, my esteemed readers, to suggest a solution I will like more than the ones I’ve already brainstormed.
Here’s the problem. I’m at a point where I want to support prioritized choice in Gazelle grammars. Prioritized choice is a way of letting grammar writers explicitly resolve ambiguities that have crept into their grammar. Take the following grammar rule:
s -> "X" | "Y";
The two alternatives in this rule (X and Y) have equal priority. That doesn’t actually matter in this case, since no text can match them both, but now consider changing the rule a bit:
s -> "X" | "X";
Now there is a problem, because an X matches both alternatives. There are two legal parse trees that are both correct according to the grammar. As a result this grammar is ambiguous. A parser doesn’t have any reason to choose one over the other, because they are of equal priority.
The solution to the problem is to let the user specify which alternative should be taken when both match. If we allow users to do this by introducing prioritized choice, our solution looks very much like how Parsing Expression Grammars (also known as PEG’s) work. And the syntax that has become very standard in the PEG literature is to use “/” as the operator for prioritized choice. So our ambiguous grammar would be made unambiguous by using prioritized choice:
s -> "X" / "X";
Now the parser will always choose the first alternative, and the grammar is no longer ambiguous. You might wonder “what’s the point?”, since the second alternative will never be taken, but in many other cases of ambiguity, the choice of lesser priority will be taken. The classic example of ambiguity (which can be resolved in this way) is if-then-else:
stmt -> "if" expr "then" stmt |
"if" expr "then" stmt "else" stmt |
expr;
expr -> "true" | "false" | /[0-9]+/;
This grammar is ambiguous, because the fragment:
if true then if true then 1 else 2
…can be parsed in two different ways. The “else” could be assigned to either “if” — both are valid according to the grammar.
So I definitely know I will be introducing prioritized choice. The question is what syntax to use. Like I mentioned before, I would really really like to use “/” since there is a strong convention from the PEG world for it. But unfortunately this introduces a major ambiguity in Gazelle’s own grammar, because it conflicts with the syntax for regex literals. What does this mean?
s -> a / b;
a -> c / d;
Is that two rules that both use prioritized choice, or is it one rule that has a giant regular expression in it?
s -> a / b;
a -> c / d;
I really don’t want to give up using slashes to introduce regular expression literals, because that is a really strong convention that a lot of programmers are familiar with. But I also really don’t want to give up using the really strongly established convention of using slash for prioritized choice.
Anyone have an idea for a compromise I might like? The only one I can think of is adding a prefix to the regex literals sort of like Perl’s m{} form for regular expressions. It could be something like m/regex/. But I hate to add that uglification to the grammar, especially since regular expressions are going to be SO much more common than prioritized choice.
Any ideas, anyone?
Bonus: the other form of ambiguity resolution
Choice/alternation isn’t the only possible source of grammar ambiguity. Gazelle’s constructs for optional and/or repeating grammar fragments can also be ambiguous. The if-the-else example is perhaps more idiomatically written in Gazelle like so:
s -> "if" expr "then" stmt ("else" stmt)? | expr;
It’s written differently, but it’s just as ambiguous as before. So we need the same sort of tool to resolve this ambiguity — a way to say “prefer to match the optional part” or “prefer to NOT match the optional part.” This is equivalent to defining a “greedyness” to these operators like Perl lets you do in regexes. The syntax I am planning for these is:
s -> a b?; // no preference to match the b or not
s -> a b+; // no preference to keep matching b or not
s -> a b*; // no preference to keep matching b or not
non-greedy variants
s -> a b?-; // prefer to NOT match b (non-greedy)
s -> a b+-; // prefer to STOP matching b (non-greedy)
s -> a b*-; // prefer to STOP matching b (non-greedy)
greedy variants
s -> a b?+; // prefer to KEEP matching b (greedy)
s -> a b++; // prefer to KEEP matching b (greedy)
s -> a b*+; // prefer to KEEP matching b (greedy)
I’m not following the Perl convention here (Perl uses “?” instead of “-” to make the match non-greedy). But on the other hand:
- Perl doesn’t have a way to explicitly make the match greedy — matches are greedy by default. So it has no convention for a “make greedier” operator.
- not very many people know/use Perl’s greediness operator anyway
So to write the if-then-else example with explicit ambiguity resolution, you would write:
"if" expr "then" stmt ("else" stmt)?+ | expr;
…because when there is an ambiguity, we want to bind the else to the most recent if.
An aside: “great, so I can just start slapping greedy and non-greedy modifiers on everything, just to be safe, right?”
Not so fast there. Ambiguity resolution isn’t something you should take lightly, because a grammar with an ambiguity in it presents a user-facing irregularity in your language. For example, the if-then-else example is a user-facing issue: either “else” pairing makes logical sense, and it’s totally arbitrary how you decide to resolve the ambiguity. You have to tell your users about it. You should strive to absolutely minimize the numbers of such ambiguities in your language whenever you can!
The danger of using ambiguity resolution operators willy-nilly is that you don’t necessarily know if they’re actually resolving ambiguities. For example, if you wrote:
s -> "X" / "Y";
…the prioritized choice here is completely extraneous, since this grammar is unambiguous already. Because ambiguities are so important to keep track of and educate your users about, you don’t want to have to wonder whether a prioritized choice is resolving an ambiguity or not. You want to know that it is.
To this end, Gazelle will keep track of all ambiguity-resolution operators and error if any of them are never actually used to resolve an ambiguity. So you can’t just sprinkle them around “just to be safe,” you must only use them where they are effective. And this will make Gazelle grammars much more useful, because you will know exactly where all your ambiguities are.
By the way, Gazelle can’t detect any ambiguity in your grammar; calculating whether an arbitrary context-free grammar is ambiguous or not is undecidable. Basically, all grammars fall into one of several categories:
- Gazelle can handle the grammar (ie. the grammar is in the green region of the diagram in my previous entry).
- Gazelle can handle the grammar except for the fact that it’s ambiguous.
- Gazelle cannot handle the grammar, and would still not be able to even if you resolved the ambiguities.
If your grammar falls into category 3, Gazelle doesn’t know whether the grammar is ambiguous or not. But for grammars in categories 1 or 2, Gazelle knows exactly where the ambiguities are.
Bonus #2: a word about Parsing Expression Grammars
Something you might say (especially if you’re a PEG fan) is “gosh Josh, if you’re going to all this work to introduce ambiguity resolution, why not just use PEG’s flat-out and abandon all this LL(*) nonsense?” Because essentially what I’m doing is supporting both PEG-based constructs (like ordered choice) and context-free grammar constructs like non-ordered choice. And PEG’s support a larger set of grammars than Gazelle ever will (I believe they support any non-left-recursive PEG, though I don’t have a reference for this).
There are two really major reasons why I don’t think PEG’s are the way to go. The first one is related to the point I was making before. With PEG’s all choice is prioritized choice. However, with PEG-based tools you have no idea if there are real ambiguities in your grammar or not. PEG’s are defined such that ambiguity does not exist, but this does not make if-then-else style issues go away, it just sweeps them under the rug. Even if you don’t call it “ambiguity,” it is still a user-facing issue that you as a language designer definitely want to know about! If you have the following in a PEG:
s -> a / b;
…you have no way of knowing if this represents an ambiguity, which means that you don’t know if changing the grammar to:
s -> b / a;
…will make a substantive difference or not! You’re flying blind.
The second reason is that Gazelle’s parsing algorithm will be far, far faster than packrat parsing (which is the algorithm used to parse PEG’s). Both algorithms will have linear asymptotic complexity (though as I briefly mentioned before, I believe that LL(*) grammars can degrade to n^2 in degenerate cases; though I will argue that this will never happen in any sane language). But though both are linear, packrat parsing’s constant factor is far higher, and packrat parsing uses O(n) space where n is the size of input.
This might all sound like hot air, but check out these real, hard numbers. On the homepage of the Aurochs parser generator, which uses PEG’s and packrat parsing to parse JavaScript, they say:
How much memory does it use?
An awful lot.
Grammar Input size Memory consumption CPU time Memory overhead Javascript 140kb 380Mb 1s 2700 Javascript 71kb 180Mb 0.45s 2500 Javascript 1717 5.6Mb 0.01s 3200
I can’t vouch for whether this is a great implementation of packrat parsing, but the authors sure seem to know what they’re doing. I hope you’ll agree that 380Mb of memory just to parse 140kb of text is horrendous, and 1s is quite slow too. Gazelle can’t parse JavaScript yet, but here is a rough ballpark estimate of what the performance will be like to parse 140kb of text once it can:
- memory: Gazelle uses 24 bytes per entry on its stack, plus 12 bytes for every token of lookahead. These are the only non-fixed costs. If we assume a parse tree that goes 100 deep, and an input that requires 100 tokens of lookahead (both ridiculously large assumptions), Gazelle’s memory footprint is 3.6kb. This is 0.0009% of the packrat parser’s memory requirement.
- CPU: Gazelle currently parses JSON at 18.8Mb/s. JavaScript will be more complicated than JSON to parse, but Gazelle will also undergo a lot of optimization that hasn’t been done yet, so I’ll go with the 18.8Mb/s figure. That’s 134x faster than the 140kb/s Aurochs is quoting.
Like I said, the Aurochs authors seem to be very smart — I’m not saying they’ve done a bad job. I’m just saying that if Aurochs is any indication of what we can expect from packrat parsing, it is orders of magnitude slower, and takes enormous amounts of memory even for modest amounts of input text. If anyone has better numbers to show for packrat parsing, I’m all ears.
So even though it’s more work, I’m 110% convinced that using this combination of LL(*) with explicit ambiguity resolution is the way to go, and is the best way to achieve the goals that I have set out for this project.
