Gazelle is (at least briefly) the most powerful top-down parser generator there is

On Wednesday when I was sitting at home waiting for the cable guy, I thought very hard about top-down (LL) parsing and lookahead generation. After a few hours of thinking I had an epiphany where I very rapidly discovered several major results that I do not believe have ever been discovered before.

I have implemented several (but not all) of these discoveries in the current Git version of Gazelle, and as far as I know, this makes Gazelle currently the most powerful top-down parser in existence.

(One caveat for the above statement — when I say “powerful” I am referring to the set of grammars it can successfully generate lookahead for. Other parser generators still have features that Gazelle does not have, like syntactic and semantic predicates. And alas, Gazelle still lacks a way to ignore whitespace until the time that I redesign this capability in a way that I am happy with).

As far as practical and powerful top-down parser generators go, I am only aware of two serious contenders. In this context, “serious contenders” means “can handle LL(k) for arbitrary k or better.” They are:

  • SLK, a commercial LL(k) parser generator. I dare not download or try it out, because I don’t want to be in any way entangled with their license agreement (which prevents reverse engineering) or tainted by what I learn about it.
  • ANTLR, which is a popular open-source parser generator written in Java. Terence Parr pioneered the LL(*) technique on which Gazelle builds, and Terence deserves a lot of credit for laying the groundwork in this field. It would have taken me a lot longer to get Gazelle where it is if not for his work.

    As a side note, Despite what the tough talk on the SLK homepage says, ANTLR has been LL(k) for years (LL(*) is a superset of LL(k)).

Here are the key discoveries I made while the cable guy was assisting other customers. Apologies if some of this material is gibberish to people who aren’t deep in the field — I’ll attempt to summarize the high-level impacts of each discovery.

Computing whether a grammar is LL(k) or LL(*) is decidable, without specifying a maximum k!

What does this mean, in practical terms? Today, when you use either SLK or ANTLR, you have to specify how long they should keep searching for correct lookahead before they give up. For SLK (which supports only LL(k)) you specify a -k command-line argument which is the maximum number of lookahead tokens to consider before giving up. For ANTLR which supports both LL(k) and LL(*), you specify both a -k and a maximum recursion depth constant. (Correction from Terence Parr).

Again, what does this mean? As a user of SLK or ANTLR, you have to choose an arbitrary number for these parameters. If you choose too low, you’ll make them give up before they find the answer. If you choose too high, you’ll have to wait longer before they fail in cases where the grammar is simply not LL(k) or LL(*). This is annoying and unintuitive for a user.

What I discovered is that this problem is decidable: you can write an algorithm that will always terminate that either finds the lookahead successfully or tell the user that the grammar is not LL(*) (and why). This is much better for users, because you don’t have to worry about those constants and have to worry that you set them too low. Big win for users.

Tail recursion is a-ok

ANTLR’s algorithm currently cannot handle rules of this sort:

s -> ("X" s)?

This is a simple tail-recursive grammar that matches any number of X’s. ANTLR balks at this, but if you apply a tail-recursive optimization very much like the one implemented in imperative languages (don’t bother making a new stack frame when you re-enter the tail-recursive rule), a top-down parser can handle this just fine. Which is what Gazelle does.

This isn’t a huge deal, since the grammar is better written:

s -> "X"+;

But it’s nice to be able to handle more cases.

LL(*) can be extended to support full-LL(*) parsing

This probably won’t make much sense to anyone who’s not deep in this field. At a high level, full-LL supports more grammars than strong-LL (which is what everyone means when they say “LL”), but full-LL parsing has not ever been considered practical. I haven’t tried it out yet, but I’m pretty sure I have a way of extending LL(*) to support full-LL(*) while still being practical and efficient. Bottom line: again, we could support more grammars. For example, here is a grammar that is full-LL(2) but not LL(2):

s -> "A" a "A" "A" | "B" a "B" "A";
a -> "B"?;

Will full-LL be that important in practice? I’m guessing not. But from a theoretical perspective this result is important because I believe it will be the first practical, efficient method anyone has ever proposed for parsing full-LL.

LL(*) could handle even more grammars by using regular supersets of nonregular lookahead

LL(*) currently can only handle situations where the part of the grammar directly following a decision is regular. In some cases, that part of the grammar is not regular because it counts (for example, nested parentheses), but it doesn’t matter because all you’re really trying to do is skip past that part to a distinguishing token. For example, this grammar is from “The Definitive ANTLR Reference”:

se -> e "%" | e "!";
e -> "(" e ")" | "ID";

In this grammar, “e” is nonregular because it matches parentheses. Regular expressions cannot count and therefore cannot match parentheses. However, if we’re trying to make a prediction at the beginning of “se”, all we’re really trying to do is skip past all that junk to get to either a “%” or a “!”. A regular expression would suffice — something like /\(*ID\)*/. The regex doesn’t count, but it skips past all those nested parentheses just fine. If the parentheses aren’t properly nested, we’ll figure that out later.

What this means for Gazelle

I made all these discoveries on Wednesday. On Saturday I had a day of mind-blowing productivity and implemented the first two of them, in addition to fixing all the previously-existing shortcomings that 0.2 had. So like I mentioned, Gazelle is currently (as far as I know) the most powerful top-down parser generator there is. Specifically, no other existing top-down parser generator can, at the moment:

  • handle tail-recursive grammars
  • tell you definitively when a grammar is not LL(k) or LL(*) (others make you tweak search depth constants)

A few days ago I emailed Terence Parr (the author of ANTLR) telling him about these findings, and I expect that he will probably add these enhancements to ANTLR until he has feature parity with Gazelle. But I just wanted to take a moment to enjoy that I made a significant discovery in an important field. The second of the above points especially is just huge.

As one final closing remark, I wanted to mention a grammar that the SLK documentation presents to boast its power. It is an LL(27) grammar with 26 terminal types. Here is what they have to say about it:

The following grammar is LL(27), and has an alphabet of 26 terminals. So the possible number of lookahead strings is 2627, or about 160000000000000000000000000000000000000. SLK can analyze this grammar and construct a parser for it in about one second.

[snip grammar]

The -y option needs to be used because the grammar is in yacc syntax. And -k=27 or larger also needs to be specified. The output of SLK for this grammar is large, so it should be redirected to a file.

Just to demonstrate that Gazelle can handle this, I checked this grammar in Gazelle format into sketches/ll27.gzl in the source tree. If you run gzlc on it with the --no-minimize-rtns option (minimization is a nice feature that is convenient for real grammars but too inefficient for this diabolical grammar), Gazelle generates correct lookahead in half a second (and this is totally unoptimized code written in Lua, a byte-code interpreted language), doesn’t require you to specify a -k=27 option, and the entire output is about 5k.

I normally wouldn’t seek to put other parser writers down, but the SLK web page rubs me the wrong way when it falsely claims that:

SLK is the only LL(k) parser generator available. All others that claim to be LL(k) are not really LL(k) by any of the classical definitions of LL(k) found in the literature. SLK does a full LL(k) analysis of the grammar. The proprietary SLK algorithm is the only known near-solution to this NP-complete problem. Other deterministic algorithms must either limit k to a very small value, usually two, or use approximate lookahead methods that are very weak imitations of LL(k).

As I mentioned before, this has not been true for a while because of ANTLR, and it is even less true now with Gazelle, which is strictly more powerful than SLK.

Update: I made a small correction that I received from Terence Parr. He also says he’s pretty sure that determining the k for an LL(k) language is undecideable, so we’ll see, maybe there’s something I’ve missed. I feel pretty confident about my solution, but I certainly could be wrong.

This entry was posted in Gazelle. Bookmark the permalink.

6 Responses to Gazelle is (at least briefly) the most powerful top-down parser generator there is

  1. maetl says:

    Well done Josh…

    Great to see that you’re not content to let the standard dogma that “parsing and grammars are amongst the most studied and well understood areas of computer science” get in the way of making new implementations in this field…

  2. Kay Roepke says:

    That’s great news!

    I just love “tell you definitively when a grammar is not LL(k) or LL(*) (others make you tweak search depth constants)” which can be a hassle currently, especially if the cause is not immediately obvious.

    All power to innovators! :)

  3. josh says:

    @maetl: thanks! It’s funny, when I started writing Gazelle, I believed pretty strongly in that mantra (“parsing and grammars are amongst the most studied and well understood areas of computer science”). I didn’t expect to invent any new algorithms, I mainly just wanted to take existing algorithms and give them different packaging.

    But before long I was trying to figure out if I could improve on the properties of those algorithms that made them less that ideal from a user perspective. I studied the documentation from existing tools and looked at the cases they said they couldn’t handle, and tried to figure out if I could handle them. And before I knew it, I had thought of a new algorithm!

    @Kay: thanks! I agree that having the tool be able to tell you definitively that it is not LL(k) or LL(*) will make things much more user-friendly. I could still use a bit of work on error messages that give you every bit of information about *why* it was not LL(k) or LL(*). In the future I should be able to draw pretty graphs that make the problem obvious. That will take a bit of work, but it’s totally doable, and I think will make for a compelling user experience.

  4. Adrian Thurston says:

    Hi Josh, you should have a look at the generalized top-down parsing algorithms. These algorithms handle any grammar except those with left recursion. Some of them also include additional hacks for dealing with left recursion. Algorithms/technologies to look at: GRDP, Unger’s Method, Combinatorial Parsing, Packrat Parsing, DCGs and TXL, Not all of these are practical, but since they are general (barring left recursion) and LL(*) is not, any tool that implements these parsing methods must be more powerful than Gazelle.

  5. Sylvain says:

    It is unsolvable whether or not a context-free grammar is LL(k) for some k>=0: see Rosenkrantz and Stearns, Properties of deterministic top-down grammars, Information and Control 17:226–256, 1970 http://www.dx.doi.org/10.1016/S0019-9958(70)90446-8

  6. Pingback: The importance of being earnest » Josh the Outspoken

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>