Archive

Archive for January 11th, 2009

LLVM Bitcode vs. Protocol Buffers

January 11th, 2009 Josh 6 comments

In a comment to a previous entry, Emmanuel Castro asks:

You chose the LLVM bitcode format for Gazelle instead of Google Protocol Buffer.

Is there any reason, or was it just because you made the choice before Protocol Buffer was available.

Protocol Buffer seems to have a better language support (C , Java, Python).

This is a good question. The easy answer is that I chose LLVM Bitcode before Protocol Buffers were available. But the harder questions are “would I make the same choice today?” and “should I port Gazelle at some point to use Protocol Buffers (or something else) instead?”

Being quite familiar with both formats (I’ve implemented both), I thought I should take a moment to compare/contrast the two. These two formats target a really similar use case (compact, binary encoding of arbitrary data) but have some significant differences.

Data Definition/Schema:

With protocol buffers, you lay out your schema in a .proto file that looks something like this:

message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;

  enum PhoneType {
    MOBILE = 0;
    HOME = 1;
    WORK = 2;
  }

  message PhoneNumber {
    required string number = 1;
    optional PhoneType type = 2 [default = HOME];
  }

  repeated PhoneNumber phone = 4;
}

Having a formal schema means that generic protobuf tools can generate code with nice classes/structures that mirror the schema. Generic tools can also pretty-print Protocol Buffers given the .proto file.

With Bitcode, on the other hand, there is no formal schema. Schemas are informally specified in some mixture of documentation (here’s LLVM’s and here is Gazelle’s) and code. This means that generic Bitcode tools have much less information to go on for things like code generation or pretty-printing.

The lack of a schema system isn’t inherent to Bitcode, and one could easily be added. But given the status quo of it not having one, Protocol Buffers are the clear winner here.

Space efficiency

Both Protocol Buffers and Bitcode are quite concerned with space efficiency, but Bitcode is a bit more extreme about it. Specifically:

  1. Bitcode files are considered a stream of bits rather than a stream of bytes. Bitcode is the absolute only format I have ever encountered that does this. This leads to greater space efficiency (since you can encode a boolean value using only 1 bit, for example), but is more expensive to encode/decode since you have to be shifting and masking a lot. I don’t have any data about how significant this is.
  2. Bitcode allows you to define “abbreviations”, which allow you to define bit-by-bit what your records will look like. You can define an abbreviation that says: “a 1-bit integer (a boolean) followed by a five-bit integer followed by a variable-bit-rate integer where each chunk is four bits.” Using this facility you can optimize your records to use exactly as many bits as they need, with none wasted! But it takes a lot of work and foresight on the application developer’s part to get the most of this.

So the general idea is: Bitcode is smaller, but probably takes more CPU and more work on the application developer’s part. But unfortunately I have no measurements about how much smaller or how much more CPU or how much more work.

Data Model

The Protocol Buffer data model is: each protocol buffer is a series of key/value pairs. Any of the values can itself be a protocol buffer, which provides hierarchy. The values can be any of 10 integer types, float, double, boolean, UTF-8 string, or raw bytes. Any of the keys can repeat if the .proto file says they can.

person {
  name: "Harry"
  age: 42
  likes_color: "blue"
  likes_color: "green"
  vehicle: {
    make: "Toyota"
    model: "Prius"
  }
}

The Bitcode data model is: files are a series of data records, and data records can be nested inside blocks to provide hierarchy. Each data record has a code and a series of integers:

[DataRecord code=5, 3, 9]
(this next record might be ASCII values for "Hello",
 but there's no string type so we don't know)
[DataRecord code=8, 72, 101, 108, 108, 111]
[block id=9]
  [DataRecord code=4, 6, 5]
  [block id=10]
    [DataRecord code=8, 8, 1]
  [/block]
[/block]

Clearly there is less semantic information at the Bitcode level, which means that generic Bitcode tools can’t show you as much. Most notably, Bitcode has no notion of a string. And since there are no keys to the records in Bitcode, everything is dictated by the order of the records in the file and the order of the integers within the records. This means that Bitcode will be more space-efficient if all of your records have the same set of fields. On the other hand, if your records have a large number of possibly optional fields but each actual record uses only a few, then Protocol Buffers will be more space efficient.

Both formats are capable of seeking over sub-records/blocks that you aren’t interested in parsing.

Adoption/support

I haven’t been following the spread of Protocol Buffers too closely (which is a shame since I keep meaning to finish my C implementation), but it’s almost certainly more widely adopted than Bitcode, which has never really been marketed at anything other than LLVM. This is definitely a concern, because I ended up having to implement Bitcode myself. This ended up being a significant diversion for me — it took me many weeks and a lot of frustration, and despite being only about 1000 lines of code, it was subtle and difficult to get right. I would have much rather used an existing library, but LLVM’s Bitcode library was C++ and I wanted to stay 100% C.

Conclusion

So on every count except space efficiency Protocol Buffers seem to be a win. If I was doing it again I might choose Protocol Buffers instead. I’d still like to see the degree to which Bitcode is more space-efficient, because space-efficiency is quite important to me.

I still have a half-implemented C protocol buffer library sitting around, and hope that someday I’ll get around to finishing it. Once I have that, I may consider porting Gazelle’s bytecode to use protocol buffers instead.

Categories: Uncategorized Tags:

The importance of being earnest

January 11th, 2009 Josh 1 comment

While I’m at it, I wanted to take a moment to recognize that I’ve too frequently made claims that are unsupported or premature. The worst example of this was when I claimed to have found an algorithm that computes LL(k) for all k (this is undecidable — my algorithm turned out to be a heuristic). In other cases, what I’ve written is technically true but misleading: for example, in claiming that Gazelle is “more powerful” than ANTLR, this was not taking into account many features that ANTLR has that Gazelle currently does not, like semantic and syntactic predicates.

I’m not proud of this. I’d like to take the opportunity to make a commitment to putting those days behind me. I am extremely proud of Gazelle and what I think it brings to the table, and I’ve tried to consistently acknowledge my debts to prior work, especially ANTLR. Indeed, in the Gazelle manual I say:

Gazelle’s algorithm takes a great deal of inspiration from ANTLR, the LL(*) parser generator written by Terence Parr. While Gazelle seeks to augment the theory that Terence developed, the foundations of Gazelle are strongly based in ANTLR’s achievements. Like ANTLR, Gazelle is a top-down parser that models its lookahead as possibly-cyclic state machines. Gazelle’s algorithm for generating this lookahead is based on ANTLR’s algorithm.

Still, I clearly need to be much more careful not to jump to conclusions, and be less susceptible to bravado. If Gazelle is as good as I think it is, its capabilities should speak for themselves.

Categories: Gazelle Tags:

Looking to 0.4

January 11th, 2009 Josh No comments

Given that whitespace has been relatively painless to implement, I think I’ll tackle a number of smaller things I’ve been meaning to do before releasing 0.4. Like whitespace-handling, these are things that prevent Gazelle from being more than a toy until they’re addressed:

  1. when Gazelle encounters a parse error, it should report it to the API in a sane way instead of doing assert(false) or just printing a message to stderr and aborting
  2. Gazelle should provide line and column information to the API (this is more complicated than it might sound — read the Wikipedia article on “newline” and the Unicode recommendations on Newline for some of the complications)
  3. Gazelle’s C API should be namespaced (with gzl_ or somesuch on all the identifiers) instead of polluting the global namespace with functions called things like “parse”
  4. a “make install” target, so that you can just type “gzlc” for the compiler instead of “. lua_path; ./compiler/gzlc”

These sound like a good set of goals for a 0.4 release. I may drop a few or think of a few others.

Categories: Gazelle Tags:

Gazelle handling whitespace/comments

January 11th, 2009 Josh No comments

I’m digging back into Gazelle after a several-month dormancy. If you happen to be someone who syndicates my commits, either directly or via your GitHub news feed (don’t laugh — I have 18 “watchers” on GitHub!) you’ll notice a flurry of commits from the last few days. I’ve been working on whitespace handling, and progress is refreshingly brisk: I have things pretty well working after only a few days of work.

Whitespace handling was for a long time the biggest thing preventing Gazelle from being actually useful. I had a whitespace-handling mechanism back in 0.1, but had to rip it out in 0.2 because it was naive and premature. It’s back now, and while it looks mostly the same from the user’s perspective, it’s totally different under the hood.

Here’s the current picture of a JSON grammar, using the new whitespace facility:

// Grammar for JSON, as defined at json.org.

@start object;

object   -> "{" (string ":" value) *(,) "}";
array    -> "[" value *(,)              "]";

str_frag -> .chars=/[^\\"]+/ |
            .unicode_char=/\\u ([0-9A-Fa-f]{4})/ |
            .backslash_char=/\\[ntbr"\/\\]/;
string   -> '"' str_frag* '"';

number   -> .sign="-"?
            .integer=/ 0 | ([1-9][0-9]*) /
            .decimal=/ \. [0-9]+ /?
            .exponent=/ [eE] [+-]? [0-9]+ /? ;

value    -> string | number | "true" | "false" | "null" | object | array;

whitespace -> .whitespace=/[\r\n\s\t]+/;

@allow whitespace object ... number, string;

It’s the @allow statement at the end that triggers the whitespace facility (which are more generally called “subparsers”). That statement means that any time the parser is inside an object (eg. all the time), but not inside a number or a string, whitespace can appear anywhere. More specifically, for all states in such rules, the @allow statement defines a “whitespace” transition back to the same state.

As a result, the rule graphs start looking like this:

array

All the whitespace transitions are a little bit annoying to look at, and I’ll definitely have to tweak the visualization to show them in a nicer way. But this gives you an idea for what’s going on.

What’s nice about this scheme is that you get to specify the whitespace rules in a simple way (”whitespace is allowed everywhere except in these certain rules”), but then the whitespace becomes a part of the same grammar data structure and is analyzed by the same algorithms as everything else.

It’s worth mentioning that this design is significantly different than the traditional design, which is to have a separate lexer simply discard whitespace and comments before they ever hits the parser. The lovely thing about doing it my way is that the whitespace becomes a part of the parse tree (if you want it to — you’ll be able to turn this off). This means:

  1. round-trips from text -> parse tree -> text are lossless
  2. therefore programs can operate on the parse tree to do whitespace-preserving transformations
  3. programs can also analyze the parse tree with regard to how it follows style rules
  4. programs like doxygen that consider comments and whitespace significant could operate on a Gazelle parse tree
Categories: Gazelle Tags: