A syntactic dilemma (and an intro to Gazelle’s ambiguity resolution)

Posted by josh at August 10th, 2008

As I’m adding more capabilities to the Gazelle grammar description language, I’ve come up against a problem for which I can’t find any solution I like. So I’m putting it to you, my esteemed readers, to suggest a solution I will like more than the ones I’ve already brainstormed.

Here’s the problem. I’m at a point where I want to support prioritized choice in Gazelle grammars. Prioritized choice is a way of letting grammar writers explicitly resolve ambiguities that have crept into their grammar. Take the following grammar rule:

s -> "X" | "Y";

The two alternatives in this rule (X and Y) have equal priority. That doesn’t actually matter in this case, since no text can match them both, but now consider changing the rule a bit:

s -> "X" | "X";

Now there is a problem, because an X matches both alternatives. There are two legal parse trees that are both correct according to the grammar. As a result this grammar is ambiguous. A parser doesn’t have any reason to choose one over the other, because they are of equal priority.

The solution to the problem is to let the user specify which alternative should be taken when both match. If we allow users to do this by introducing prioritized choice, our solution looks very much like how Parsing Expression Grammars (also known as PEG’s) work. And the syntax that has become very standard in the PEG literature is to use “/” as the operator for prioritized choice. So our ambiguous grammar would be made unambiguous by using prioritized choice:

s -> "X" / "X";

Now the parser will always choose the first alternative, and the grammar is no longer ambiguous. You might wonder “what’s the point?”, since the second alternative will never be taken, but in many other cases of ambiguity, the choice of lesser priority will be taken. The classic example of ambiguity (which can be resolved in this way) is if-then-else:

stmt -> "if" expr "then" stmt |
        "if" expr "then" stmt "else" stmt |
        expr;
expr -> "true" | "false" | /[0-9]+/;

This grammar is ambiguous, because the fragment:

if true then if true then 1 else 2

…can be parsed in two different ways. The “else” could be assigned to either “if” — both are valid according to the grammar.

So I definitely know I will be introducing prioritized choice. The question is what syntax to use. Like I mentioned before, I would really really like to use “/” since there is a strong convention from the PEG world for it. But unfortunately this introduces a major ambiguity in Gazelle’s own grammar, because it conflicts with the syntax for regex literals. What does this mean?

s -> a / b;
a -> c / d;

Is that two rules that both use prioritized choice, or is it one rule that has a giant regular expression in it?

s -> a / b;
a -> c /
d;

I really don’t want to give up using slashes to introduce regular expression literals, because that is a really strong convention that a lot of programmers are familiar with. But I also really don’t want to give up using the really strongly established convention of using slash for prioritized choice.

Anyone have an idea for a compromise I might like? The only one I can think of is adding a prefix to the regex literals sort of like Perl’s m{} form for regular expressions. It could be something like m/regex/. But I hate to add that uglification to the grammar, especially since regular expressions are going to be SO much more common than prioritized choice.

Any ideas, anyone?

Bonus: the other form of ambiguity resolution

Choice/alternation isn’t the only possible source of grammar ambiguity. Gazelle’s constructs for optional and/or repeating grammar fragments can also be ambiguous. The if-the-else example is perhaps more idiomatically written in Gazelle like so:

s -> "if" expr "then" stmt ("else" stmt)? | expr;

It’s written differently, but it’s just as ambiguous as before. So we need the same sort of tool to resolve this ambiguity — a way to say “prefer to match the optional part” or “prefer to NOT match the optional part.” This is equivalent to defining a “greedyness” to these operators like Perl lets you do in regexes. The syntax I am planning for these is:

s -> a b?; // no preference to match the b or not
s -> a b+; // no preference to keep matching b or not
s -> a b*; // no preference to keep matching b or not

non-greedy variants
s -> a b?-; // prefer to NOT match b (non-greedy)
s -> a b+-; // prefer to STOP matching b (non-greedy)
s -> a b*-; // prefer to STOP matching b (non-greedy)

greedy variants
s -> a b?+; // prefer to KEEP matching b (greedy)
s -> a b++; // prefer to KEEP matching b (greedy)
s -> a b*+; // prefer to KEEP matching b (greedy)

I’m not following the Perl convention here (Perl uses “?” instead of “-” to make the match non-greedy). But on the other hand:

  1. Perl doesn’t have a way to explicitly make the match greedy — matches are greedy by default. So it has no convention for a “make greedier” operator.
  2. not very many people know/use Perl’s greediness operator anyway

So to write the if-then-else example with explicit ambiguity resolution, you would write:

"if" expr "then" stmt ("else" stmt)?+ | expr;

…because when there is an ambiguity, we want to bind the else to the most recent if.

An aside: “great, so I can just start slapping greedy and non-greedy modifiers on everything, just to be safe, right?”

Not so fast there. Ambiguity resolution isn’t something you should take lightly, because a grammar with an ambiguity in it presents a user-facing irregularity in your language. For example, the if-then-else example is a user-facing issue: either “else” pairing makes logical sense, and it’s totally arbitrary how you decide to resolve the ambiguity. You have to tell your users about it. You should strive to absolutely minimize the numbers of such ambiguities in your language whenever you can!

The danger of using ambiguity resolution operators willy-nilly is that you don’t necessarily know if they’re actually resolving ambiguities. For example, if you wrote:

s -> "X" / "Y";

…the prioritized choice here is completely extraneous, since this grammar is unambiguous already. Because ambiguities are so important to keep track of and educate your users about, you don’t want to have to wonder whether a prioritized choice is resolving an ambiguity or not. You want to know that it is.

To this end, Gazelle will keep track of all ambiguity-resolution operators and error if any of them are never actually used to resolve an ambiguity. So you can’t just sprinkle them around “just to be safe,” you must only use them where they are effective. And this will make Gazelle grammars much more useful, because you will know exactly where all your ambiguities are.

By the way, Gazelle can’t detect any ambiguity in your grammar; calculating whether an arbitrary context-free grammar is ambiguous or not is undecidable. Basically, all grammars fall into one of several categories:

  1. Gazelle can handle the grammar (ie. the grammar is in the green region of the diagram in my previous entry).
  2. Gazelle can handle the grammar except for the fact that it’s ambiguous.
  3. Gazelle cannot handle the grammar, and would still not be able to even if you resolved the ambiguities.

If your grammar falls into category 3, Gazelle doesn’t know whether the grammar is ambiguous or not. But for grammars in categories 1 or 2, Gazelle knows exactly where the ambiguities are.

Bonus #2: a word about Parsing Expression Grammars

Something you might say (especially if you’re a PEG fan) is “gosh Josh, if you’re going to all this work to introduce ambiguity resolution, why not just use PEG’s flat-out and abandon all this LL(*) nonsense?” Because essentially what I’m doing is supporting both PEG-based constructs (like ordered choice) and context-free grammar constructs like non-ordered choice. And PEG’s support a larger set of grammars than Gazelle ever will (I believe they support any non-left-recursive PEG, though I don’t have a reference for this).

There are two really major reasons why I don’t think PEG’s are the way to go. The first one is related to the point I was making before. With PEG’s all choice is prioritized choice. However, with PEG-based tools you have no idea if there are real ambiguities in your grammar or not. PEG’s are defined such that ambiguity does not exist, but this does not make if-then-else style issues go away, it just sweeps them under the rug. Even if you don’t call it “ambiguity,” it is still a user-facing issue that you as a language designer definitely want to know about! If you have the following in a PEG:

s -> a / b;

…you have no way of knowing if this represents an ambiguity, which means that you don’t know if changing the grammar to:

s -> b / a;

…will make a substantive difference or not! You’re flying blind.

The second reason is that Gazelle’s parsing algorithm will be far, far faster than packrat parsing (which is the algorithm used to parse PEG’s). Both algorithms will have linear asymptotic complexity (though as I briefly mentioned before, I believe that LL(*) grammars can degrade to n^2 in degenerate cases; though I will argue that this will never happen in any sane language). But though both are linear, packrat parsing’s constant factor is far higher, and packrat parsing uses O(n) space where n is the size of input.

This might all sound like hot air, but check out these real, hard numbers. On the homepage of the Aurochs parser generator, which uses PEG’s and packrat parsing to parse JavaScript, they say:

How much memory does it use?

An awful lot.

Grammar Input size Memory consumption CPU time Memory overhead
Javascript 140kb 380Mb 1s 2700
Javascript 71kb 180Mb 0.45s 2500
Javascript 1717 5.6Mb 0.01s 3200

I can’t vouch for whether this is a great implementation of packrat parsing, but the authors sure seem to know what they’re doing. I hope you’ll agree that 380Mb of memory just to parse 140kb of text is horrendous, and 1s is quite slow too. Gazelle can’t parse JavaScript yet, but here is a rough ballpark estimate of what the performance will be like to parse 140kb of text once it can:

  • memory: Gazelle uses 24 bytes per entry on its stack, plus 12 bytes for every token of lookahead. These are the only non-fixed costs. If we assume a parse tree that goes 100 deep, and an input that requires 100 tokens of lookahead (both ridiculously large assumptions), Gazelle’s memory footprint is 3.6kb. This is 0.0009% of the packrat parser’s memory requirement.
  • CPU: Gazelle currently parses JSON at 18.8Mb/s. JavaScript will be more complicated than JSON to parse, but Gazelle will also undergo a lot of optimization that hasn’t been done yet, so I’ll go with the 18.8Mb/s figure. That’s 134x faster than the 140kb/s Aurochs is quoting.

Like I said, the Aurochs authors seem to be very smart — I’m not saying they’ve done a bad job. I’m just saying that if Aurochs is any indication of what we can expect from packrat parsing, it is orders of magnitude slower, and takes enormous amounts of memory even for modest amounts of input text. If anyone has better numbers to show for packrat parsing, I’m all ears.

So even though it’s more work, I’m 110% convinced that using this combination of LL(*) with explicit ambiguity resolution is the way to go, and is the best way to achieve the goals that I have set out for this project.

Posted in Gazelle| 2 Comments | 

The grammars Gazelle can handle

Posted by josh at August 6th, 2008

I wanted to follow-up a bit more on my recent posts about the grammars Gazelle and competing parsing frameworks can handle. Here is an Euler diagram showing my current understanding of what grammars Gazelle can handle compared with the well-known Strong-LL language classes as well as competing parsing frameworks.

GrammarClasses.png

Gazelle can currently generate lookahead for any LL(k) grammar, all LL(*) grammars that ANTLR can handle, and a few LL(*) grammars (namely tail-recursive ones) that ANTLR cannot. ANTLR can handle all LL(k) grammars and many LL(*) ones. SLK can handle only LL(k).

Since my last entry, I gave Gazelle the capability to correctly generate lookahead for all LL(k) languages, as long as the user specifies a maximum k. This is in-line with what SLK and ANTLR do, since the problem is undecidable in general. By default I still use my heuristic that I think will detect in most cases whether a grammar will be LL(*) or LL(k) without making the user specify a maximum k, but this heuristic will be foiled in some cases. So if the user really thinks their grammar is LL(k) even though it failed the heuristic, they can specify a maximum “k” and Gazelle will keep searching until it hits that “k”.

Just in case I wasn’t clear enough in my last entry, I strongly dispute the claims on the SLK website that SLK is the only framework that does true Strong-LL(k) analysis. As you can see from the diagram, both ANTLR and Gazelle can handle not only LL(k) but a superset thereof. Current and potential SLK customers: don’t be fooled by SLK’s claims! They haven’t been true for years — ANTLR has been able to handle LL(k) for a long time!

Here are some examples of grammars that fall within the different regions of that Euler diagram:

Strong-LL(*) but can’t be handled by Gazelle or ANTLR:
s: a | b;
a -> "(" a ")" | "X";
b -> "(" b ")" | "Y";

Can be handled by Gazelle but not ANTLR:
s -> a "X" | a "Y";
a -> ("Z" a)?;

Can be handled by ANTLR but not SLK (and is not Strong-LL(k)):
s -> a "X" | a "Y";
a -> "Z"*;

A note about generalized parsing

When I have talked about what grammars ANTLR can “handle,” I’m specifically talking about the grammars it can generate LL(*) lookahead for, without the help of syntactic or semantic predicates. Put more plainly, these are the grammars that ANTLR can parse efficiently and without any hints from the user.

Why the caveats? Well you can put ANTLR and comparable tools in a mode where it parses by depth-first search, using backtracking to abandon search attempts that did not work out. This is equivalent to what most regexp implementations (Perl, Python, Ruby, etc) do, and as Russ Cox has demonstrated this technique has severe (exponential) worst case performance. If you want to trade space for time, you can memoize this depth-first search and get linear time at the cost of some space. While I think these techniques are totally legitimate as a way of expanding the grammars the tool can handle, I personally am singularly focused on parsing in linear time with very low space overhead. I am doing everything I possibly can to parse real languages (everything from JSON to C++) without going outside these bounds. That’s what’s going to make it practical to use Gazelle grammars for things like syntax highlighting in text editors.

Secondly, ANTLR has features that let the user help the parser out with its decisions, either by writing Java code (semantic predicates) or by specifying syntax fragments that the user promises will properly disambiguate alternatives (syntactic predicates). It’s always useful to have more tools of this sort, and Gazelle will certainly include some with time, but again this is an apples to oranges comparison.

Finally, Adrian Thurston (author of the popular Ragel state machine compiler) mentioned on one of my previous entries that there exist general (eg. can handle any grammar) algorithms for top-down parsing, which are automatically more powerful than Gazelle. While this is true, none of them are what I consider practical. Gazelle’s time/space complexity are:

Time: linear
(actually LL(*) could probably become quadratic in degenerate cases, but LL(k) is definitely linear and LL(*) is too, for all intents and practical purposes).

Space: max stack depth + max lookahead depth
Both can be capped. In a bit more detail:

  • The maximum stack depth represents the nesting level (for example, number of nested parentheses).
  • The maximum lookahead depth represents the number of tokens we have to look ahead in the worst case to disambiguate alternatives.
  • Input text that doesn’t respect the maximum depth limits will be rejected (even if it is valid according to the grammar), but that’s the only way to prevent malicious input text from consuming arbitrary amounts of memory. Clients that don’t mind getting DoS’d can just set these limits to unlimited.

No general method can approach this space/time complexity. And I think it is for this reason that no tool (AFAIK) which is aimed at practical use implements a generalized top-down method.

I’d like to post a slightly more detailed run-down of the generalized methods Adrian mentioned when I have the time to take a closer look at them.

So to make a long story short, I stand by my previous claim that Gazelle is (at least briefly) the most powerful practical top-down parsing framework there is.

Posted in Gazelle| No Comments | 

A partial retraction to my last entry

Posted by josh at July 29th, 2008

From a theoretical standpoint, my last entry was incorrect when I said that my new algorithm in Gazelle can decide whether a grammar is strong-LL(k) or strong-LL(*). I had proofs against me — these problems have actually been proven undecidable.

Once I heard this result, I spent hours trying to think of a grammar that is strong-LL(k), but that Gazelle will say is not. I am fairly certain that Gazelle’s algorithm will always terminate, so if the proof is to be believed, there must be a grammar that Gazelle falsely claims is not strong-LL(k) or strong-LL(*).

After many hours, I came up with one. Here it is:

s -> "X"* "Y" "Y" "Z"| "X" c;
c -> "Y" c "Y" | "Q";

This grammar is strong-LL(5), but Gazelle falsely claims it is not strong-LL(k) for any k. ANTLR can handle it (as long as the recursion depth is set high enough).

I’m thinking, though, that this isn’t really a problem. My intuition says that it will be rare that anyone actually tries to write a grammar that fails in this way — after all, it took an evening of specifically trying to exercise this case before I could find a grammar that did. I don’t think real grammars will run into this problem.

So even though I can’t handle all of strong-LL(*), I think I can handle the grammars that will actually come up in practical use, and I still have the extremely desirable property of being able to decide whether a grammar fits or not without forcing the user to tweak recursion depth constants. So although my result isn’t as theoretically significant as I thought, I think that from a practical standpoint I’m still in good shape: I have a terminating algorithm that can handle a very useful set of grammars.

Posted in Gazelle| 1 Comment | 

Gazelle is (at least briefly) the most powerful top-down parser generator there is

Posted by josh at July 27th, 2008

On Wednesday when I was sitting at home waiting for the cable guy, I thought very hard about top-down (LL) parsing and lookahead generation. After a few hours of thinking I had an epiphany where I very rapidly discovered several major results that I do not believe have ever been discovered before.

I have implemented several (but not all) of these discoveries in the current Git version of Gazelle, and as far as I know, this makes Gazelle currently the most powerful top-down parser in existence.

(One caveat for the above statement — when I say “powerful” I am referring to the set of grammars it can successfully generate lookahead for. Other parser generators still have features that Gazelle does not have, like syntactic and semantic predicates. And alas, Gazelle still lacks a way to ignore whitespace until the time that I redesign this capability in a way that I am happy with).

As far as practical and powerful top-down parser generators go, I am only aware of two serious contenders. In this context, “serious contenders” means “can handle LL(k) for arbitrary k or better.” They are:

  • SLK, a commercial LL(k) parser generator. I dare not download or try it out, because I don’t want to be in any way entangled with their license agreement (which prevents reverse engineering) or tainted by what I learn about it.
  • ANTLR, which is a popular open-source parser generator written in Java. Terence Parr pioneered the LL(*) technique on which Gazelle builds, and Terence deserves a lot of credit for laying the groundwork in this field. It would have taken me a lot longer to get Gazelle where it is if not for his work.

    As a side note, Despite what the tough talk on the SLK homepage says, ANTLR has been LL(k) for years (LL(*) is a superset of LL(k)).

Here are the key discoveries I made while the cable guy was assisting other customers. Apologies if some of this material is gibberish to people who aren’t deep in the field — I’ll attempt to summarize the high-level impacts of each discovery.

Computing whether a grammar is LL(k) or LL(*) is decidable, without specifying a maximum k!

What does this mean, in practical terms? Today, when you use either SLK or ANTLR, you have to specify how long they should keep searching for correct lookahead before they give up. For SLK (which supports only LL(k)) you specify a -k command-line argument which is the maximum number of lookahead tokens to consider before giving up. For ANTLR which supports both LL(k) and LL(*), you specify both a -k and a maximum recursion depth constant. (Correction from Terence Parr).

Again, what does this mean? As a user of SLK or ANTLR, you have to choose an arbitrary number for these parameters. If you choose too low, you’ll make them give up before they find the answer. If you choose too high, you’ll have to wait longer before they fail in cases where the grammar is simply not LL(k) or LL(*). This is annoying and unintuitive for a user.

What I discovered is that this problem is decidable: you can write an algorithm that will always terminate that either finds the lookahead successfully or tell the user that the grammar is not LL(*) (and why). This is much better for users, because you don’t have to worry about those constants and have to worry that you set them too low. Big win for users.

Tail recursion is a-ok

ANTLR’s algorithm currently cannot handle rules of this sort:

s -> ("X" s)?

This is a simple tail-recursive grammar that matches any number of X’s. ANTLR balks at this, but if you apply a tail-recursive optimization very much like the one implemented in imperative languages (don’t bother making a new stack frame when you re-enter the tail-recursive rule), a top-down parser can handle this just fine. Which is what Gazelle does.

This isn’t a huge deal, since the grammar is better written:

s -> "X"+;

But it’s nice to be able to handle more cases.

LL(*) can be extended to support full-LL(*) parsing

This probably won’t make much sense to anyone who’s not deep in this field. At a high level, full-LL supports more grammars than strong-LL (which is what everyone means when they say “LL”), but full-LL parsing has not ever been considered practical. I haven’t tried it out yet, but I’m pretty sure I have a way of extending LL(*) to support full-LL(*) while still being practical and efficient. Bottom line: again, we could support more grammars. For example, here is a grammar that is full-LL(2) but not LL(2):

s -> "A" a "A" "A" | "B" a "B" "A";
a -> "B"?;

Will full-LL be that important in practice? I’m guessing not. But from a theoretical perspective this result is important because I believe it will be the first practical, efficient method anyone has ever proposed for parsing full-LL.

LL(*) could handle even more grammars by using regular supersets of nonregular lookahead

LL(*) currently can only handle situations where the part of the grammar directly following a decision is regular. In some cases, that part of the grammar is not regular because it counts (for example, nested parentheses), but it doesn’t matter because all you’re really trying to do is skip past that part to a distinguishing token. For example, this grammar is from “The Definitive ANTLR Reference”:

se -> e "%" | e "!";
e -> "(" e ")" | "ID";

In this grammar, “e” is nonregular because it matches parentheses. Regular expressions cannot count and therefore cannot match parentheses. However, if we’re trying to make a prediction at the beginning of “se”, all we’re really trying to do is skip past all that junk to get to either a “%” or a “!”. A regular expression would suffice — something like /\(*ID\)*/. The regex doesn’t count, but it skips past all those nested parentheses just fine. If the parentheses aren’t properly nested, we’ll figure that out later.

What this means for Gazelle

I made all these discoveries on Wednesday. On Saturday I had a day of mind-blowing productivity and implemented the first two of them, in addition to fixing all the previously-existing shortcomings that 0.2 had. So like I mentioned, Gazelle is currently (as far as I know) the most powerful top-down parser generator there is. Specifically, no other existing top-down parser generator can, at the moment:

  • handle tail-recursive grammars
  • tell you definitively when a grammar is not LL(k) or LL(*) (others make you tweak search depth constants)

A few days ago I emailed Terence Parr (the author of ANTLR) telling him about these findings, and I expect that he will probably add these enhancements to ANTLR until he has feature parity with Gazelle. But I just wanted to take a moment to enjoy that I made a significant discovery in an important field. The second of the above points especially is just huge.

As one final closing remark, I wanted to mention a grammar that the SLK documentation presents to boast its power. It is an LL(27) grammar with 26 terminal types. Here is what they have to say about it:

The following grammar is LL(27), and has an alphabet of 26 terminals. So the possible number of lookahead strings is 2627, or about 160000000000000000000000000000000000000. SLK can analyze this grammar and construct a parser for it in about one second.

[snip grammar]

The -y option needs to be used because the grammar is in yacc syntax. And -k=27 or larger also needs to be specified. The output of SLK for this grammar is large, so it should be redirected to a file.

Just to demonstrate that Gazelle can handle this, I checked this grammar in Gazelle format into sketches/ll27.gzl in the source tree. If you run gzlc on it with the --no-minimize-rtns option (minimization is a nice feature that is convenient for real grammars but too inefficient for this diabolical grammar), Gazelle generates correct lookahead in half a second (and this is totally unoptimized code written in Lua, a byte-code interpreted language), doesn’t require you to specify a -k=27 option, and the entire output is about 5k.

I normally wouldn’t seek to put other parser writers down, but the SLK web page rubs me the wrong way when it falsely claims that:

SLK is the only LL(k) parser generator available. All others that claim to be LL(k) are not really LL(k) by any of the classical definitions of LL(k) found in the literature. SLK does a full LL(k) analysis of the grammar. The proprietary SLK algorithm is the only known near-solution to this NP-complete problem. Other deterministic algorithms must either limit k to a very small value, usually two, or use approximate lookahead methods that are very weak imitations of LL(k).

As I mentioned before, this has not been true for a while because of ANTLR, and it is even less true now with Gazelle, which is strictly more powerful than SLK.

Update: I made a small correction that I received from Terence Parr. He also says he’s pretty sure that determining the k for an LL(k) language is undecideable, so we’ll see, maybe there’s something I’ve missed. I feel pretty confident about my solution, but I certainly could be wrong.

Posted in Gazelle| 4 Comments | 

Why Gazelle Matters, part 2

Posted by josh at July 9th, 2008

Every day when I read the programming reddit, I see things that reaffirm to me why Gazelle matters.

Yesterday it was an “ask reddit”: Need a C library for parsing C files, suggestions?. Responses include:

  • “How about gcc?”, clearly not realizing that gcc is (1) not a library and (2) ridiculously complicated.
  • “Here are an ANSI C grammar for lex and yacc. […] (NB: These were last updated in 1995, so I don’t know if you’ll need to tweak them any. But, at least they’ll get you close […])”. I am constantly surprised that people do not realize how utterly useless it is to have software of unknown quality. Can you imagine asking for an implementation of SHA1, and having someone hand you some code and saying “I’m not sure if it’s quite right, but it should at least get you part of the way there.” You don’t know where the possible problems are, you don’t know anything about its design process — you might as well start from scratch, that’s how useless it is to have half-written code.
  • “Language.C: manipulating and generating C abstract syntax from Haskell”. So here you have someone who’s doing the hard work of parsing C and making sure it’s correct, but he’s writing his library in Haskell so it only works for Haskell. Sadly useless to 99.9% of the programming world.

Someone also did mention Elsa, which is probably the best solution to the original guy’s problem, but then someone else replied:

elsa looks really good. I need code detail that the preprocessor doesn’t know about, but elsa can probably get me there. Any interest in a python wrapper?

There you go again — even when someone has done a good job (like Elsa has), it’s still useless to people parsing from other languages unless you write specific “bindings” for each language you want to parse from. Madness, pure madness! The idea that it should take N^2 work to parse N languages from N languages is madness.

The two most important design goals of Gazelle are:

  • reusable grammars: grammars that can be used by anyone without modification. Grammars that can have test suites, to ensure quality and give people confidence that they are correct.
  • grammars that you can use from any language, without bindings. Sure, you have to write bindings for Gazelle itself, but that only has to happen once, not once per language you want to parse. So parsing N languages from N languages takes N+N effort, not N^2.

Posted in Gazelle| 2 Comments | 

Next Postings »