Archive

Archive for June, 2007

Wishlist for JSON

June 28th, 2007 Josh 4 comments

JSON is great. It’s a nice implementation of the “heterogeneous structure of hashes and arrays” pattern that I have come across over and over. Just off the top of my head:

  • Perl, Python, Ruby, Javascript, and pretty much any language like them uses them as their basic data structure
  • The Tibco Rendezvous message format.
  • The NeXTSTEP property list file format.
  • The XML-RPC data format (even though it’s encoded in XML).
  • PDF files use them to specify data
  • Other times I can’t mention because I encountered them inside companies

This pattern is seriously everywhere. This is why JSON is nice: now there’s a sort of quasi-standard encoding people can reach for when they encounter this problem.

However, there are two things I reeeaallly wish it had, that many of the formats I mentioned above do have.

  1. A raw binary type. All it’s got right now are unicode strings, which are great if you’ve got known-good unicode data. But if you want to dump data of unknown encoding, or that might be malformed unicode, or that is just a big binary blob, JSON can’t help you. You’d have to convert it to 7-bit first using an encoding like quoted-printable or Base64. This is a bummer, and as an extra bummer, the data would no longer be self-describing — you’d have to store elsewhere (or know out-of-band) that the data is encoded this way.
  2. A date type. Right now your only option is to encode a date as a string or a number. I kind of like this guy’s proposal for doing that. He proposes encoding dates like so:
    ["FOO": @2007-06-28@]

With these two features added, JSON would be a lot more useful. Yeah, it wouldn’t be JavaScript compatible any more, but these features should make it into JavaScript also! Apparently there is a chance that the date literal notation will make it into JavaScript v2…

Categories: Uncategorized Tags:

Draft Syntax

June 25th, 2007 Josh No comments

To make my goals and my approach a little more tangible, here is a draft syntax file for parsing JSON using my engine.

number   -> /(-)? ( 0 | ([1-9][0-9]*) ) (\. [0-9]+ )? ([eE] [+-]? [0-9]+ )?/;
str_frag -> /[^\\"]+/ | /\\u ([0-9A-F]{4})/ | /\\[ntbr"\/\\]/;
string   -> '"' str_frag* '"';
value    -> string | number | "true" | "false" | "null" | object | array;
object   -> "{" (string ":" value) *(,) "}";
array    -> "[" value *(,)              "]";

whitespace -> /[\r\n\s\t]+/;
allow whitespace in object, array

I think this will be mostly self-explanatory, but I want to point out a few things:

  • There are no actions in this file! To use this from a language like ruby, you would write a program something like this:
    # json.grm is the file that has the grammar above
    parser = Parser.new("json.grm")
    parser.on("string") { |string| puts "Parsed a string: #{string}" }
    parser.parse("foo.json")
    

    Keeping the actions out of the grammar file is what will make the grammar reusable.

  • There is no separation between parser and lexer. To describe terminals, you use either strings or regular expressions. I first got this idea from these guys. I initially resisted, because they seemed to hand-wave around the practical problems of this approach. However, as I thought about it more, I became convinced that it should be possible for the parsing runtime to be smart enough to parse efficiently without needing the programmer to tell it what productions are “tokens” and which are not.
  • The construct *(some-str) is like a kleene star, but lets you specify a string that splits the occurrences. For example, /\d+/ *(,) matches 1,2,3,4,5 sort of like (/\d+/ ( ',' /\d+/ ) *)? would, but in a lot fewer characters.
  • “allow whitespace” lets us be tolerant to whitespace without having to explicitly put it between every symbol in the grammar. In traditional lexer/parser designs, the lexer would do this automatically, but then the whitespace would be gone forever, which is bad news if you want to do whitespace-preserving transformations on programs.

I can’t actually parse/interpret this grammar yet, but I’m getting there! I really like where this is going; the syntax specification is really short, easy to understand, and I think it contains enough information to build a parser that is as fast as theoretically possible. However, I have only my eyes and my opinion: I’m interested to hear what other people think about it.

Categories: Gazelle Tags:

Parsing framework status

June 4th, 2007 Josh 4 comments

It’s time for me to post an update about how my parsing engine is going. In a word: very well!

I’ve decided to use Lua to implement the parts that aren’t on the critical path. All the up-front work of creating NFAs, DFAs, and parsing tables isn’t on the critical path; it’s one-time overhead. Since Lua is pretty fast, garbage-collected, and quite small (~150k compiled), I think implementing these parts in it is a smart choice. Also, I can compile the Lua to byte-code and stick that in a static character array in my C library, so my library will still be entirely self-contained. It won’t be obvious to the engine’s users that Lua is involved at all.

So what have I implemented? So far, I have a large part of the lexer complete. I have written:

  • NFA construction. This lets you build up NFAs by composing primitives like concatenation, alternation, and kleene star. So for example, to build an NFA for the regex ab* you could call concat(char("a"), kleene(char("b")). (Check out the code with beautiful inline comments with ASCII-art diagrams).
  • NFA-to-DFA conversion. This takes advantage of the well-known result that NFAs can be simulated with DFAs. There is a bit of complication in the fact that for a lexer, you want to match any of several NFAs simultaneously (because any token could come next), and you want to know what token matched (so the final states must carry extra information).
  • DFA minimization: Given the DFA produced in the NFA-to-DFA conversion, create an equivalent DFA with the minimum number of states. It turns out that the best-known algorithm for doing this (Hopcroft’s algorithm, published in 1971, which is O(n lg n)) is not very well known, and not presented in any of the textbooks I’ve been using. Even the illustrious “Dragon Book” gives the more straightforward O(n^2) (at least — I haven’t analyzed it) algorithm. The textbook Engineering a Compiler claims to give Hopcroft’s algorithm, but also gives the more straightforward and expensive variant. To figure out how to implement this algorithm, I had to look in Hopcroft’s original 1971 paper. I also went to the UW library and photocopied a 1972 article by David Gries that explains the algorithm in a bit more digestible way.

    This minimization algorithm was a bit of an adventure — I don’t know if I’ve ever had to reach back more than 30 years into the literature to find the state of the art.

  • a recursive-descent parser for regular expressions: it handles most of the basic things you expect from regular expressions. For now I’m going with mostly Perl-compatible syntax. The one thing I’m intentionally diverging on (for now) is that spaces are insignificant; you use \s when you want to match a space. Regular expressions are already hard enough to read — authors should at least have whitespace as a tool to make them more readable.

The code is in my public Mercurial repository. Here’s my favorite part. Total code size so far: 1156 lines (not including unit tests or code that dumps the DFA into graphviz format).

Here is an example of it in action. Here are minimal DFAs for parsing JSON. There are two: one for when you’re inside a string, and one for the rest of the time.

json1

json2

I created these by feeding files with regexes into this program, which dumps a graph in dot format.

require "regex_parser"
require "nfa_to_dfa"
require "sketches/regex_debug"
require "minimize"
 
nfas = {}
while true do
  local line = io.read()
  local label = io.read()
  if line == nil then break end
  nfa = parse_regex(TokenStream:new(line))
  table.insert(nfas, {nfa, label})
end
 
dfa = nfas_to_dfa(nfas)
minimal_dfa = hopcroft_minimize(dfa)
print(minimal_dfa)
Categories: Gazelle Tags:

The Balkanization of Distributed Version Control

June 3rd, 2007 Josh 11 comments

I think it’s a shame that distributed version control systems are so fragmented. Today you’ve got darcs, mercurial (hg), git, monotone, codeville, bzr, and more to choose from. Fragmentation in this space is really bad because it’s really hard to interoperate between VCS’s. Since each tool has its own learning curve, interacting with a new project could mean having to learn yet another VCS tool.

What really bothers me about this situation is that this field (dVCS) is not very well understood yet. I think there are very few people out there who can really weigh the fundamental differences between these systems. So most people either choose to go with the first system they try, or they prefer one to the other based on strengths or weaknesses of the implementation — eg. git is too hard to understand, I can’t use git on windows, etc.

Even very visible and respected people who write about this only compare the fundamental differences in very vague terms; Ted Tso says Git has “more legs” (more potential) and Keith Packard is leery of Mercurial because Linux’s implementation of file truncation (which Mercurial uses for crash recovery) is racy.

Do you see the problem? We’re making big decisions about how we’re going to encode the history of our precious code based on shallow characteristics of the current implementations. But Mercurial and Git differ in a really fundamental way: Git stores the nodes (file revisions), Mercurial stores the edges (diffs between revisions). And I don’t think anyone fully understands all the implications of either choice.

I’d love to see some academic research that really exhaustively compares all of the dVCS approaches that are flying around, and spells out their fundamental advantages and disadvantages. But until that happens, I’m afraid we’re going to see more of the status quo, where people choose sides based on shallow reasoning and stick with their team until the end.

Categories: Uncategorized Tags: