In a comment to a previous entry, Emmanuel Castro asks:

You chose the LLVM bitcode format for Gazelle instead of Google Protocol Buffer. Is there any reason, or was it just because you made the choice before Protocol Buffer was available. Protocol Buffer seems to have a better language support (C , Java, Python).

This is a good question. The easy answer is that I chose LLVM Bitcode before Protocol Buffers were available. But the harder questions are “would I make the same choice today?” and “should I port Gazelle at some point to use Protocol Buffers (or something else) instead?”

Being quite familiar with both formats (I’ve implemented both), I thought I should take a moment to compare/contrast the two. These two formats target a really similar use case (compact, binary encoding of arbitrary data) but have some significant differences.

Data Definition/Schema:

With protocol buffers, you lay out your schema in a .proto file that looks something like this:

message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;

  enum PhoneType {
    MOBILE = 0;
    HOME = 1;
    WORK = 2;
  }

  message PhoneNumber {
    required string number = 1;
    optional PhoneType type = 2 [default = HOME];
  }

  repeated PhoneNumber phone = 4;
}

Having a formal schema means that generic protobuf tools can generate code with nice classes/structures that mirror the schema. Generic tools can also pretty-print Protocol Buffers given the .proto file.

With Bitcode, on the other hand, there is no formal schema. Schemas are informally specified in some mixture of documentation (here’s LLVM’s and here is Gazelle’s) and code. This means that generic Bitcode tools have much less information to go on for things like code generation or pretty-printing.

The lack of a schema system isn’t inherent to Bitcode, and one could easily be added. But given the status quo of it not having one, Protocol Buffers are the clear winner here.

Space efficiency

Both Protocol Buffers and Bitcode are quite concerned with space efficiency, but Bitcode is a bit more extreme about it. Specifically:

  1. Bitcode files are considered a stream of bits rather than a stream of bytes. Bitcode is the absolute only format I have ever encountered that does this. This leads to greater space efficiency (since you can encode a boolean value using only 1 bit, for example), but is more expensive to encode/decode since you have to be shifting and masking a lot. I don't have any data about how significant this is.
  2. Bitcode allows you to define "abbreviations", which allow you to define bit-by-bit what your records will look like. You can define an abbreviation that says: "a 1-bit integer (a boolean) followed by a five-bit integer followed by a variable-bit-rate integer where each chunk is four bits." Using this facility you can optimize your records to use exactly as many bits as they need, with none wasted! But it takes a lot of work and foresight on the application developer's part to get the most of this.

So the general idea is: Bitcode is smaller, but probably takes more CPU and more work on the application developer’s part. But unfortunately I have no measurements about how much smaller or how much more CPU or how much more work.

Data Model

The Protocol Buffer data model is: each protocol buffer is a series of key/value pairs. Any of the values can itself be a protocol buffer, which provides hierarchy. The values can be any of 10 integer types, float, double, boolean, UTF-8 string, or raw bytes. Any of the keys can repeat if the .proto file says they can.

person {
  name: "Harry"
  age: 42
  likes_color: "blue"
  likes_color: "green"
  vehicle: {
    make: "Toyota"
    model: "Prius"
  }
}

The Bitcode data model is: files are a series of data records, and data records can be nested inside blocks to provide hierarchy. Each data record has a code and a series of integers:

[DataRecord code=5, 3, 9]
(this next record might be ASCII values for "Hello",
 but there's no string type so we don't know)
[DataRecord code=8, 72, 101, 108, 108, 111]
[block id=9]
  [DataRecord code=4, 6, 5]
  [block id=10]
    [DataRecord code=8, 8, 1]
  [/block]
[/block]

Clearly there is less semantic information at the Bitcode level, which means that generic Bitcode tools can’t show you as much. Most notably, Bitcode has no notion of a string. And since there are no keys to the records in Bitcode, everything is dictated by the order of the records in the file and the order of the integers within the records. This means that Bitcode will be more space-efficient if all of your records have the same set of fields. On the other hand, if your records have a large number of possibly optional fields but each actual record uses only a few, then Protocol Buffers will be more space efficient.

Both formats are capable of seeking over sub-records/blocks that you aren’t interested in parsing.

Adoption/support

I haven’t been following the spread of Protocol Buffers too closely (which is a shame since I keep meaning to finish my C implementation), but it’s almost certainly more widely adopted than Bitcode, which has never really been marketed at anything other than LLVM. This is definitely a concern, because I ended up having to implement Bitcode myself. This ended up being a significant diversion for me – it took me many weeks and a lot of frustration, and despite being only about 1000 lines of code, it was subtle and difficult to get right. I would have much rather used an existing library, but LLVM’s Bitcode library was C++ and I wanted to stay 100% C.

Conclusion

So on every count except space efficiency Protocol Buffers seem to be a win. If I was doing it again I might choose Protocol Buffers instead. I’d still like to see the degree to which Bitcode is more space-efficient, because space-efficiency is quite important to me.

I still have a half-implemented C protocol buffer library sitting around, and hope that someday I’ll get around to finishing it. Once I have that, I may consider porting Gazelle’s bytecode to use protocol buffers instead.