Draft Syntax
To make my goals and my approach a little more tangible, here is a draft syntax file for parsing JSON using my engine.
number -> /(-)? ( 0 | ([1-9][0-9]*) ) (\. [0-9]+ )? ([eE] [+-]? [0-9]+ )?/; str_frag -> /[^\\"]+/ | /\\u ([0-9A-F]{4})/ | /\\[ntbr"\/\\]/; string -> '"' str_frag* '"'; value -> string | number | "true" | "false" | "null" | object | array; object -> "{" (string ":" value) *(,) "}"; array -> "[" value *(,) "]"; whitespace -> /[\r\n\s\t]+/; allow whitespace in object, array
I think this will be mostly self-explanatory, but I want to point out a few things:
- There are no actions in this file! To use this from a language like ruby, you would write a program something like this:
# json.grm is the file that has the grammar above parser = Parser.new("json.grm") parser.on("string") { |string| puts "Parsed a string: #{string}" } parser.parse("foo.json")Keeping the actions out of the grammar file is what will make the grammar reusable.
- There is no separation between parser and lexer. To describe terminals, you use either strings or regular expressions. I first got this idea from these guys. I initially resisted, because they seemed to hand-wave around the practical problems of this approach. However, as I thought about it more, I became convinced that it should be possible for the parsing runtime to be smart enough to parse efficiently without needing the programmer to tell it what productions are “tokens” and which are not.
- The construct *(some-str) is like a kleene star, but lets you specify a string that splits the occurrences. For example, /\d+/ *(,) matches 1,2,3,4,5 sort of like (/\d+/ ( ',' /\d+/ ) *)? would, but in a lot fewer characters.
- “allow whitespace” lets us be tolerant to whitespace without having to explicitly put it between every symbol in the grammar. In traditional lexer/parser designs, the lexer would do this automatically, but then the whitespace would be gone forever, which is bad news if you want to do whitespace-preserving transformations on programs.
I can’t actually parse/interpret this grammar yet, but I’m getting there! I really like where this is going; the syntax specification is really short, easy to understand, and I think it contains enough information to build a parser that is as fast as theoretically possible. However, I have only my eyes and my opinion: I’m interested to hear what other people think about it.
Categories: Gazelle
Recent Comments