Why Gazelle Matters
Posted by josh at January 7th, 2008
I’m not sure if I’ve managed to convince very many people that my parsing framework Gazelle is a big deal. Most people don’t think of parsing as something they do a lot of. A few quick and dirty regular expressions here and there is what most programmers get by on. “Oh, I need to validate an email address? Hrm, that’s something like /[\w\.\-]*@[\w\.]*/ right? Oh wait, the username can have escaped @-signs in it? Ah hell, my regex is close enough.
But forget about all that for a moment. Let’s forget about the crazy productivity gains you could get from just being able to pull a standard email address parsing module that works from any language. Let’s forget about all that and just talk about performance.
I want to talk about Mongrel for a second. Mongrel has been the star of the Ruby Webserver world for a long time because it vastly outperforms the painfully slow pure-Ruby server “Webrick” that ships with Ruby. Mongrel has become something of a soap opera; I won’t go into it, but while that escapade spiraled into a wild frenzy of name calling, dramatic exits, and infighting, another little webserver called Thin came up out of nowhere and poised itself to do Mongrel better than Mongrel.
I haven’t tried Thin myself or verified any of their claims, but they claim to be even faster than Mongrel. How? Well according to them, it is by using “the Mongrel parser, the root of Mongrel speed and security.”
Did you catch that?
Amidst all this drama, time, and effort, the core of Mongrel technology is a fast parser. Mongrel’s technical edge boils down to this, which is a description of the HTTP language written with the regular language parsing tool Ragel.
Imagine if parsers this fast and this powerful (more powerful actually, since Ragel can’t handle context-free grammars) were available directly from Ruby. You would be able to get the performance of Mongrel or Thin without writing any C code at all. Using the Ragel parser, on the other hand, requires writing a custom extension for Ruby, because Ragel generates plain C code.
What’s better, you wouldn’t have to write the grammar file yourself, because chances are you won’t be the first person who wants to parse HTTP.
That is why Gazelle matters. Am I making any sense?
I like this pitch. Though I think “the crazy productivity gains you could get from just being able to pull a standard email address parsing module” are perhaps overemphasized.
One other really good thing about Gazelle you didn’t mention - language notation is almost always *way* easier to read and the resultant parsed forms are more flexible.
buffalo
Did you follow the link about how strange email addresses actually are to parse? How almost nobody who’s written email address validation has actually gotten it right?
That’s exactly the problem. Parsing things correctly is almost always a lot harder than it seems.
josh
Hi Josh,
Ben told me about the work your doing on record-streams and gazelle. He was talking about the ~10x speed improvement of recs-collate since you rewrote it to be in C. One promising JSON parser that I’ve been playing around with is YAJL:
http://www.lloydforge.org/projects/yajl/
Anyways, I think YAJL is a pretty good JSON parser if for nothing else the fact that it can be used in non-blocking code (like if you are doing event based programming).
That being said, I totally “get” Gazelle… gazelle transcends “just a fast JSON parser”. In fact, I’ve always imagined in my head something similar to this when trying to write fast parsers in pure-Perl. Originally, I was thinking of trying to write a janky wrapper around yacc/bison so I can “steal” the action and goto table, then see about writing a pure-Perl driver that uses the action and goto table that YACC generated.
However, if a driver for the action/goto table was written in C and could be used for *all* parsers, that would be awesome! Personally, I try to minimize the amount of “outside” C code embedded in my VM (no matter if it is Perl, Java, or whatever), and by doing things the way you propose with Gazelle, this “outside” C code is kept to a minimum as well as being truly reusable.
I’m really excited about Gazelle, I’ve downloaded it and started looking into the code… my only suggestion so far is to make it so your event driven parser can “chomp on” buffers rather than reading from a FILE*. This way, if you are in a select() based event loop, you can read() as much as possible until you get an EAGAIN without being blocked on I/O. Your events can then get fired off until the incomplete input causes problems.
Also, I totally agree that parsers are *everywhere*. If you do any sort of RPC, you need a parser to unserialize the RPC results, if you are validating input, you need a parser, etc. Parsers are also really useful to allow people to enter information in a more natural way. For example, if you want to be able to except arbitrary dates (”one week ago”, “Feb 27th 2007″, etc). Another example is an interface language… who wants to write a WSDL file when you can write a Java interface with special annotations, turning that into the ruby client RPC code, etc.
Good luck on your efforts with Gazelle!
-Brian
Brian Maher
Hey Josh,
thx for talking about my project (Thin)
Ragel also supports compiling the state machine to Ruby.
Do you think Gazelle produce faster code then Ragel (or will)? If so, I’d like to talk w/ you, maybe we can work together on making a better and faster HTTP parser.
Send me an email if you’re interested
macournoyer
Hey Brian, sorry for taking so long to respond to you. Thanks so much for your vote of confidence, and for “getting” Gazelle. And I totally agree about needing an API that takes just memory pointers instead of a FILE* — what you see right now is just the beginning!
josh