Every day when I read the programming reddit, I see things that reaffirm to me why Gazelle matters.
Yesterday it was an “ask reddit”: Need a C library for parsing C files, suggestions?. Responses include:
- “How about gcc?”, clearly not realizing that gcc is (1) not a library and (2) ridiculously complicated.
- “Here are an ANSI C grammar for lex and yacc. [...] (NB: These were last updated in 1995, so I don’t know if you’ll need to tweak them any. But, at least they’ll get you close [...])”. I am constantly surprised that people do not realize how utterly useless it is to have software of unknown quality. Can you imagine asking for an implementation of SHA1, and having someone hand you some code and saying “I’m not sure if it’s quite right, but it should at least get you part of the way there.” You don’t know where the possible problems are, you don’t know anything about its design process — you might as well start from scratch, that’s how useless it is to have half-written code.
- “Language.C: manipulating and generating C abstract syntax from Haskell”. So here you have someone who’s doing the hard work of parsing C and making sure it’s correct, but he’s writing his library in Haskell so it only works for Haskell. Sadly useless to 99.9% of the programming world.
Someone also did mention Elsa, which is probably the best solution to the original guy’s problem, but then someone else replied:
elsa looks really good. I need code detail that the preprocessor doesn’t know about, but elsa can probably get me there. Any interest in a python wrapper?
There you go again — even when someone has done a good job (like Elsa has), it’s still useless to people parsing from other languages unless you write specific “bindings” for each language you want to parse from. Madness, pure madness! The idea that it should take N^2 work to parse N languages from N languages is madness.
The two most important design goals of Gazelle are:
- reusable grammars: grammars that can be used by anyone without modification. Grammars that can have test suites, to ensure quality and give people confidence that they are correct.
- grammars that you can use from any language, without bindings. Sure, you have to write bindings for Gazelle itself, but that only has to happen once, not once per language you want to parse. So parsing N languages from N languages takes N+N effort, not N^2.
Of course, Language.C is done in Haskell because strongly typed,
functional languages with algebraic data types and pattern matching are
particularly well suited to the task of language processing. They’re are
a sweet point for compiler writing.
That said, Haskell’s perfectly interoperable with C, or Python, or
anything that understands the C calling convention, so you could just
bind to Language.C from other languages (much as others bind to CIL, yet
another C parser/analyser, this time in OCaml).
Hi Don,
I’m not sure that I agree that languages like Haskell are the best for compiler writing. But then again, my project (which I intend to be very very good at it) is still partially vapor, so I can’t make any grandiose claims about being better than Haskell in this problem domain. Yet.
But for your second point, you seem to be missing the N^2 vs. N+N argument I was trying to make. Sure, of course you can write bindings. I’ll ignore for a moment the practical matters of what it means to embed Haskell into your process and just assume that bindings are a practical and compelling solution.
Even if I grant you that, writing bindings for each language pair that you want to parse from/to is madness! It takes N^2 work. If you do things my way, you have to write:
- a parser for C
- a parser for Haskell
- a parser for JavaScript
- Gazelle bindings for C
- Gazelle bindings for Haskell
- Gazelle bindings for JavaScript
At that point, I can parse Haskell from JavaScript, or C from Haskell, or any of the N^2 combinations, but it only took me N+N work. If I do things your way, to get the same functionality, I have to write:
- a parser for C
- a parser for Haskell
- a parser for JavaScript
- C bindings for the C parser
- C bindings for the Haskell parser
- C bindings for the JavaScript parser
- Haskell bindings for the C parser
- Haskell bindings for the Haskell parser
- Haskell bindings for the JavaScript parser
- JavaScript bindings for the C parser
- JavaScript bindings for the Haskell parser
- JavaScript bindings for the JavaScript parser.
Bindings aren’t trivial. So at the end of the day, since no one is actually going to do the N^2 work, the Language.C person will say “sure, you can use my library, all you have to do is write bindings!” and the person will decide that bindings are too much work for the return you get (because they are), or they’ll be partially written and sporadically maintained bindings that aren’t useful to anyone else, and we’ll keep having the fragmented parser landscape we have today.
Or I’ll finish Gazelle and that pain will be a thing of the past.