Archive

Archive for July, 2008

A partial retraction to my last entry

July 29th, 2008 Josh 1 comment

From a theoretical standpoint, my last entry was incorrect when I said that my new algorithm in Gazelle can decide whether a grammar is strong-LL(k) or strong-LL(*). I had proofs against me — these problems have actually been proven undecidable.

Once I heard this result, I spent hours trying to think of a grammar that is strong-LL(k), but that Gazelle will say is not. I am fairly certain that Gazelle’s algorithm will always terminate, so if the proof is to be believed, there must be a grammar that Gazelle falsely claims is not strong-LL(k) or strong-LL(*).

After many hours, I came up with one. Here it is:

s -> "X"* "Y" "Y" "Z"| "X" c;
c -> "Y" c "Y" | "Q";

This grammar is strong-LL(5), but Gazelle falsely claims it is not strong-LL(k) for any k. ANTLR can handle it (as long as the recursion depth is set high enough).

I’m thinking, though, that this isn’t really a problem. My intuition says that it will be rare that anyone actually tries to write a grammar that fails in this way — after all, it took an evening of specifically trying to exercise this case before I could find a grammar that did. I don’t think real grammars will run into this problem.

So even though I can’t handle all of strong-LL(*), I think I can handle the grammars that will actually come up in practical use, and I still have the extremely desirable property of being able to decide whether a grammar fits or not without forcing the user to tweak recursion depth constants. So although my result isn’t as theoretically significant as I thought, I think that from a practical standpoint I’m still in good shape: I have a terminating algorithm that can handle a very useful set of grammars.

Categories: Gazelle Tags:

Gazelle is (at least briefly) the most powerful top-down parser generator there is

July 27th, 2008 Josh 5 comments

On Wednesday when I was sitting at home waiting for the cable guy, I thought very hard about top-down (LL) parsing and lookahead generation. After a few hours of thinking I had an epiphany where I very rapidly discovered several major results that I do not believe have ever been discovered before.

I have implemented several (but not all) of these discoveries in the current Git version of Gazelle, and as far as I know, this makes Gazelle currently the most powerful top-down parser in existence.

(One caveat for the above statement — when I say “powerful” I am referring to the set of grammars it can successfully generate lookahead for. Other parser generators still have features that Gazelle does not have, like syntactic and semantic predicates. And alas, Gazelle still lacks a way to ignore whitespace until the time that I redesign this capability in a way that I am happy with).

As far as practical and powerful top-down parser generators go, I am only aware of two serious contenders. In this context, “serious contenders” means “can handle LL(k) for arbitrary k or better.” They are:

  • SLK, a commercial LL(k) parser generator. I dare not download or try it out, because I don’t want to be in any way entangled with their license agreement (which prevents reverse engineering) or tainted by what I learn about it.
  • ANTLR, which is a popular open-source parser generator written in Java. Terence Parr pioneered the LL(*) technique on which Gazelle builds, and Terence deserves a lot of credit for laying the groundwork in this field. It would have taken me a lot longer to get Gazelle where it is if not for his work.

    As a side note, Despite what the tough talk on the SLK homepage says, ANTLR has been LL(k) for years (LL(*) is a superset of LL(k)).

Here are the key discoveries I made while the cable guy was assisting other customers. Apologies if some of this material is gibberish to people who aren’t deep in the field — I’ll attempt to summarize the high-level impacts of each discovery.

Computing whether a grammar is LL(k) or LL(*) is decidable, without specifying a maximum k!

What does this mean, in practical terms? Today, when you use either SLK or ANTLR, you have to specify how long they should keep searching for correct lookahead before they give up. For SLK (which supports only LL(k)) you specify a -k command-line argument which is the maximum number of lookahead tokens to consider before giving up. For ANTLR which supports both LL(k) and LL(*), you specify both a -k and a maximum recursion depth constant. (Correction from Terence Parr).

Again, what does this mean? As a user of SLK or ANTLR, you have to choose an arbitrary number for these parameters. If you choose too low, you’ll make them give up before they find the answer. If you choose too high, you’ll have to wait longer before they fail in cases where the grammar is simply not LL(k) or LL(*). This is annoying and unintuitive for a user.

What I discovered is that this problem is decidable: you can write an algorithm that will always terminate that either finds the lookahead successfully or tell the user that the grammar is not LL(*) (and why). This is much better for users, because you don’t have to worry about those constants and have to worry that you set them too low. Big win for users.

Tail recursion is a-ok

ANTLR’s algorithm currently cannot handle rules of this sort:

s -> ("X" s)?

This is a simple tail-recursive grammar that matches any number of X’s. ANTLR balks at this, but if you apply a tail-recursive optimization very much like the one implemented in imperative languages (don’t bother making a new stack frame when you re-enter the tail-recursive rule), a top-down parser can handle this just fine. Which is what Gazelle does.

This isn’t a huge deal, since the grammar is better written:

s -> "X"+;

But it’s nice to be able to handle more cases.

LL(*) can be extended to support full-LL(*) parsing

This probably won’t make much sense to anyone who’s not deep in this field. At a high level, full-LL supports more grammars than strong-LL (which is what everyone means when they say “LL”), but full-LL parsing has not ever been considered practical. I haven’t tried it out yet, but I’m pretty sure I have a way of extending LL(*) to support full-LL(*) while still being practical and efficient. Bottom line: again, we could support more grammars. For example, here is a grammar that is full-LL(2) but not LL(2):

s -> "A" a "A" "A" | "B" a "B" "A";
a -> "B"?;

Will full-LL be that important in practice? I’m guessing not. But from a theoretical perspective this result is important because I believe it will be the first practical, efficient method anyone has ever proposed for parsing full-LL.

LL(*) could handle even more grammars by using regular supersets of nonregular lookahead

LL(*) currently can only handle situations where the part of the grammar directly following a decision is regular. In some cases, that part of the grammar is not regular because it counts (for example, nested parentheses), but it doesn’t matter because all you’re really trying to do is skip past that part to a distinguishing token. For example, this grammar is from “The Definitive ANTLR Reference”:

se -> e "%" | e "!";
e -> "(" e ")" | "ID";

In this grammar, “e” is nonregular because it matches parentheses. Regular expressions cannot count and therefore cannot match parentheses. However, if we’re trying to make a prediction at the beginning of “se”, all we’re really trying to do is skip past all that junk to get to either a “%” or a “!”. A regular expression would suffice — something like /\(*ID\)*/. The regex doesn’t count, but it skips past all those nested parentheses just fine. If the parentheses aren’t properly nested, we’ll figure that out later.

What this means for Gazelle

I made all these discoveries on Wednesday. On Saturday I had a day of mind-blowing productivity and implemented the first two of them, in addition to fixing all the previously-existing shortcomings that 0.2 had. So like I mentioned, Gazelle is currently (as far as I know) the most powerful top-down parser generator there is. Specifically, no other existing top-down parser generator can, at the moment:

  • handle tail-recursive grammars
  • tell you definitively when a grammar is not LL(k) or LL(*) (others make you tweak search depth constants)

A few days ago I emailed Terence Parr (the author of ANTLR) telling him about these findings, and I expect that he will probably add these enhancements to ANTLR until he has feature parity with Gazelle. But I just wanted to take a moment to enjoy that I made a significant discovery in an important field. The second of the above points especially is just huge.

As one final closing remark, I wanted to mention a grammar that the SLK documentation presents to boast its power. It is an LL(27) grammar with 26 terminal types. Here is what they have to say about it:

The following grammar is LL(27), and has an alphabet of 26 terminals. So the possible number of lookahead strings is 2627, or about 160000000000000000000000000000000000000. SLK can analyze this grammar and construct a parser for it in about one second.

[snip grammar]

The -y option needs to be used because the grammar is in yacc syntax. And -k=27 or larger also needs to be specified. The output of SLK for this grammar is large, so it should be redirected to a file.

Just to demonstrate that Gazelle can handle this, I checked this grammar in Gazelle format into sketches/ll27.gzl in the source tree. If you run gzlc on it with the --no-minimize-rtns option (minimization is a nice feature that is convenient for real grammars but too inefficient for this diabolical grammar), Gazelle generates correct lookahead in half a second (and this is totally unoptimized code written in Lua, a byte-code interpreted language), doesn’t require you to specify a -k=27 option, and the entire output is about 5k.

I normally wouldn’t seek to put other parser writers down, but the SLK web page rubs me the wrong way when it falsely claims that:

SLK is the only LL(k) parser generator available. All others that claim to be LL(k) are not really LL(k) by any of the classical definitions of LL(k) found in the literature. SLK does a full LL(k) analysis of the grammar. The proprietary SLK algorithm is the only known near-solution to this NP-complete problem. Other deterministic algorithms must either limit k to a very small value, usually two, or use approximate lookahead methods that are very weak imitations of LL(k).

As I mentioned before, this has not been true for a while because of ANTLR, and it is even less true now with Gazelle, which is strictly more powerful than SLK.

Update: I made a small correction that I received from Terence Parr. He also says he’s pretty sure that determining the k for an LL(k) language is undecideable, so we’ll see, maybe there’s something I’ve missed. I feel pretty confident about my solution, but I certainly could be wrong.

Categories: Gazelle Tags:

Good experience with Comcast

July 26th, 2008 Josh No comments

There’s a story on Slashdot right now about how Comcast is reading blogs that mention them as a way to try and improve their image among customers. I guess Comcast has a generally bad reputation when it comes to customer service.

I just wanted to mention that I personally had a good experience with Comcast this week (not perfect, but quite good). I think that if a company is really trying to turn itself around and provide good service, they should be recognized for it, so this is my part in doing so.

This week I needed to move my internet service from my old apartment into my new condo. I called Comcast customer service on Sunday at 6PM (+1 for having 24/7 customer service) to schedule an appointment. The 800 number made me choose 2 or 3 menu options (+/-0) after which I talked to a human being almost immediately (+1). The person I talked to was friendly and helpful (+1), though the first available appointment wasn’t until Wednesday 8-12 (+/-0). The cs rep gave me a discount on the visit which is apparently $100 and cut it down to $10 for me (+1!).

On Wednesday I stayed home from work from 8-12 waiting for the rep. I didn’t wake up until about 8:30am and then I took a shower, so by nine I was slightly concerned by the possibility that I could have missed the field tech. I just didn’t want to find at noon that I had waited four hours in vain and would have to stay home from work for another four hours on a different day. So I called the 800 number again, again was able to talk to a rep almost immediately, who again was friendly and helpful and able to answer my question right away — they could confirm that the tech hadn’t visited me yet (+1). So I waited at home until noon.

By noon the field tech still hadn’t arrived (-1). I called the 800 number, again almost immediately talked to someone friendly and helpful, and the rep said they’d contact the field agent and have him call me. 10 minutes later the field agent calls me and says he’ll be there in 5 minutes. 5 minutes later he’s at my doorstep.

The field agent was also friendly and helpful (+1) — he was happy to explain a lot about what he was doing, he took the time to do the job right, and he even ran speakeasy’s bandwidth test on my computer once he was done. When he left he gave me his card and said I could contact him with any problems.

So it looks like Comcast is really doing the right things to turn their image around. And I think that should be recognized (at least for as long as the level of service is like what I experienced this week).

Categories: Uncategorized Tags:

416 Random People with RoR on their resume + Reply All = Reverse Flash Mob

July 17th, 2008 Josh 25 comments

The awesomest thing ever happened to me today. I was apparently one of 416 lucky people who looked promising to a probably-clueless recruiter who’s looking to hire for a Ruby on Rails job.

Date: Thu, 17 Jul 2008 21:57:59 -0500
From: “Pradipta [lastname]”
Subject: Ruby on Rails position…
To: [416 people]

I have a couple of Ruby on Rails position, wanted to know if you are interested?

Max [lastname]
Technical Recruiter

Of course, as the number of people on a thread increases, the chances that one of them will hit “reply all” approaches 1 asymptotically:

Date: Thu, 17 Jul 2008 20:10:37 -0700
From: Zack [lastname]
To: [416 people]
Subject: Re: Ruby on Rails position…

What are the positions?

Which triggers the absolute inevitability that there will be follow-ups about how people should stop hitting “reply all.”

Date: Thu, 17 Jul 2008 22:15:39 -0500
From: “Mark [lastname]”
To: “Zack [lastname]”
Cc: [416 people]
Subject: Re: Ruby on Rails position…

this is fun. it’s like a message thread i did not subscribe to. please no more ‘reply all’s. thanks.

Best Regards,

Mark [lastname]

This was amusing already. But then people started taking advantage of the fact that they had a list of 416 people who were ostensibly interested in Ruby on Rails, and started posting other job offers, conversing about who’s going to OSCON, and volunteering their efforts to remove bad email addresses from the list of 416 that the thread started with. It wasn’t long until people started suggesting that this be made into an actual email list.

It is such a fascinating thing to watch. It’s like a flash mob, except the surprise is on us, the mob participants. It’s like we were all beamed into the same virtual room by one single person who chose the group of us, and left us to figure out what to make of the situation. I can’t wait to see where this ends up.

Update (10:30pm, a scant 2.5 hours after the initial mail went out): there’s a Google Group now — they’re calling it Pradipta’s Rolodex, after the recruiter who emailed us all to begin with.

Update (1am): Pradipta’s Rolodex now has a logo — pretty hilarious.

Update (July 18, 10:40am): First off, sorry the blog was down for a while. I enabled caching (which isn’t enabled by default in Wordpress) so hopefully the site can handle the load now. Also, the real Pradipta has spoken up and apologized for the whole incident. Though I’d like to think that only the exceptionally grumpy among us are really all that upset.

Categories: Uncategorized Tags:

100 lines of C that can parse any Protocol Buffer

July 12th, 2008 Josh 5 comments

There’s lots of misinformation flying around the blogosphere about Google’s Protocol Buffers. One common claim is that you can’t parse a protobuf without having the .proto file. This is false, as demonstrated by This 100-line C program that does just that. It can parse an arbitrary protobuf into its field numbers and wire types. This is pretty closely equivalent to what you get from a generic XML parser, except that with XML you get names for the keys (elements) instead of numbers and strings instead of the four or so wire types that are defined by protocol buffers.

In both the XML and the Protocol Buffer case, you want to have more information if you’re going to actually write programs that consume application-domain data. You want documentation that specifies exactly what all the fields mean, and enough information to turn the on-the-wire values into actual numbers where appropriate. It just so happens that Protocol Buffers specify this information in a structured format called a .proto file.

Update, 12:50PM July 12: apparently I wasn’t clear enough: my 100 lines of C does not use any Protocol Buffer library. It implements the decoding itself. My point is that the format is so simple that you can parse it generically in 100 lines of C. (If you’re wondering: you’d be lucky to get within a factor of 10 for a bare-bones XML parser).

Categories: Uncategorized Tags: