Posted by josh at July 27th, 2008

On Wednesday when I was sitting at home waiting for the cable guy, I thought very hard about top-down (LL) parsing and lookahead generation. After a few hours of thinking I had an epiphany where I very rapidly discovered several major results that I do not believe have ever been discovered before.

I have implemented several (but not all) of these discoveries in the current Git version of Gazelle, and as far as I know, this makes Gazelle currently the most powerful top-down parser in existence.

(One caveat for the above statement — when I say “powerful” I am referring to the set of grammars it can successfully generate lookahead for. Other parser generators still have features that Gazelle does not have, like syntactic and semantic predicates. And alas, Gazelle still lacks a way to ignore whitespace until the time that I redesign this capability in a way that I am happy with).

As far as practical and powerful top-down parser generators go, I am only aware of two serious contenders. In this context, “serious contenders” means “can handle LL(k) for arbitrary k or better.” They are:

  • SLK, a commercial LL(k) parser generator. I dare not download or try it out, because I don’t want to be in any way entangled with their license agreement (which prevents reverse engineering) or tainted by what I learn about it.
  • ANTLR, which is a popular open-source parser generator written in Java. Terence Parr pioneered the LL(*) technique on which Gazelle builds, and Terence deserves a lot of credit for laying the groundwork in this field. It would have taken me a lot longer to get Gazelle where it is if not for his work.

    As a side note, Despite what the tough talk on the SLK homepage says, ANTLR has been LL(k) for years (LL(*) is a superset of LL(k)).

Here are the key discoveries I made while the cable guy was assisting other customers. Apologies if some of this material is gibberish to people who aren’t deep in the field — I’ll attempt to summarize the high-level impacts of each discovery.

Computing whether a grammar is LL(k) or LL(*) is decidable, without specifying a maximum k!

What does this mean, in practical terms? Today, when you use either SLK or ANTLR, you have to specify how long they should keep searching for correct lookahead before they give up. For SLK (which supports only LL(k)) you specify a -k command-line argument which is the maximum number of lookahead tokens to consider before giving up. For ANTLR which supports both LL(k) and LL(*), you specify both a -k and a maximum recursion depth constant. (Correction from Terence Parr).

Again, what does this mean? As a user of SLK or ANTLR, you have to choose an arbitrary number for these parameters. If you choose too low, you’ll make them give up before they find the answer. If you choose too high, you’ll have to wait longer before they fail in cases where the grammar is simply not LL(k) or LL(*). This is annoying and unintuitive for a user.

What I discovered is that this problem is decidable: you can write an algorithm that will always terminate that either finds the lookahead successfully or tell the user that the grammar is not LL(*) (and why). This is much better for users, because you don’t have to worry about those constants and have to worry that you set them too low. Big win for users.

Tail recursion is a-ok

ANTLR’s algorithm currently cannot handle rules of this sort:

s -> ("X" s)?

This is a simple tail-recursive grammar that matches any number of X’s. ANTLR balks at this, but if you apply a tail-recursive optimization very much like the one implemented in imperative languages (don’t bother making a new stack frame when you re-enter the tail-recursive rule), a top-down parser can handle this just fine. Which is what Gazelle does.

This isn’t a huge deal, since the grammar is better written:

s -> "X"+;

But it’s nice to be able to handle more cases.

LL(*) can be extended to support full-LL(*) parsing

This probably won’t make much sense to anyone who’s not deep in this field. At a high level, full-LL supports more grammars than strong-LL (which is what everyone means when they say “LL”), but full-LL parsing has not ever been considered practical. I haven’t tried it out yet, but I’m pretty sure I have a way of extending LL(*) to support full-LL(*) while still being practical and efficient. Bottom line: again, we could support more grammars. For example, here is a grammar that is full-LL(2) but not LL(2):

s -> "A" a "A" "A" | "B" a "B" "A";
a -> "B"?;

Will full-LL be that important in practice? I’m guessing not. But from a theoretical perspective this result is important because I believe it will be the first practical, efficient method anyone has ever proposed for parsing full-LL.

LL(*) could handle even more grammars by using regular supersets of nonregular lookahead

LL(*) currently can only handle situations where the part of the grammar directly following a decision is regular. In some cases, that part of the grammar is not regular because it counts (for example, nested parentheses), but it doesn’t matter because all you’re really trying to do is skip past that part to a distinguishing token. For example, this grammar is from “The Definitive ANTLR Reference”:

se -> e "%" | e "!";
e -> "(" e ")" | "ID";

In this grammar, “e” is nonregular because it matches parentheses. Regular expressions cannot count and therefore cannot match parentheses. However, if we’re trying to make a prediction at the beginning of “se”, all we’re really trying to do is skip past all that junk to get to either a “%” or a “!”. A regular expression would suffice — something like /\(*ID\)*/. The regex doesn’t count, but it skips past all those nested parentheses just fine. If the parentheses aren’t properly nested, we’ll figure that out later.

What this means for Gazelle

I made all these discoveries on Wednesday. On Saturday I had a day of mind-blowing productivity and implemented the first two of them, in addition to fixing all the previously-existing shortcomings that 0.2 had. So like I mentioned, Gazelle is currently (as far as I know) the most powerful top-down parser generator there is. Specifically, no other existing top-down parser generator can, at the moment:

  • handle tail-recursive grammars
  • tell you definitively when a grammar is not LL(k) or LL(*) (others make you tweak search depth constants)

A few days ago I emailed Terence Parr (the author of ANTLR) telling him about these findings, and I expect that he will probably add these enhancements to ANTLR until he has feature parity with Gazelle. But I just wanted to take a moment to enjoy that I made a significant discovery in an important field. The second of the above points especially is just huge.

As one final closing remark, I wanted to mention a grammar that the SLK documentation presents to boast its power. It is an LL(27) grammar with 26 terminal types. Here is what they have to say about it:

The following grammar is LL(27), and has an alphabet of 26 terminals. So the possible number of lookahead strings is 2627, or about 160000000000000000000000000000000000000. SLK can analyze this grammar and construct a parser for it in about one second.

[snip grammar]

The -y option needs to be used because the grammar is in yacc syntax. And -k=27 or larger also needs to be specified. The output of SLK for this grammar is large, so it should be redirected to a file.

Just to demonstrate that Gazelle can handle this, I checked this grammar in Gazelle format into sketches/ll27.gzl in the source tree. If you run gzlc on it with the --no-minimize-rtns option (minimization is a nice feature that is convenient for real grammars but too inefficient for this diabolical grammar), Gazelle generates correct lookahead in half a second (and this is totally unoptimized code written in Lua, a byte-code interpreted language), doesn’t require you to specify a -k=27 option, and the entire output is about 5k.

I normally wouldn’t seek to put other parser writers down, but the SLK web page rubs me the wrong way when it falsely claims that:

SLK is the only LL(k) parser generator available. All others that claim to be LL(k) are not really LL(k) by any of the classical definitions of LL(k) found in the literature. SLK does a full LL(k) analysis of the grammar. The proprietary SLK algorithm is the only known near-solution to this NP-complete problem. Other deterministic algorithms must either limit k to a very small value, usually two, or use approximate lookahead methods that are very weak imitations of LL(k).

As I mentioned before, this has not been true for a while because of ANTLR, and it is even less true now with Gazelle, which is strictly more powerful than SLK.

Update: I made a small correction that I received from Terence Parr. He also says he’s pretty sure that determining the k for an LL(k) language is undecideable, so we’ll see, maybe there’s something I’ve missed. I feel pretty confident about my solution, but I certainly could be wrong.