Gazelle/upb status and plans (aka: On Releasing)
This summer my friends Ben and Mike gave me grief about never releasing anything. Their criticism is definitely valid to some degree. I’ve been working on Gazelle for about two years now, and upb for almost one. Gazelle has had four releases in that time, but they have mostly focused on moving Gazelle to where I think it ought to be, as opposed to releasing something hacky that people can actually use now. There is a class of problems that Gazelle is useful for now, but it is pretty small in comparison to the amount of work I’ve put in.
I haven’t released upb at all yet, and my last message indicating I’m thinking of porting it to C++ will probably make skeptical readers think I’m moving farther away from a release rather than closer to one.
Since I agree that my progress doesn’t look too promising to someone observing from the outside, let me say where I think these projects currently are, where they’re going, and when they’re likely to release.
First of all, Gazelle is currently pushed on the stack until I have upb released. The reason is that I realized that Protocol Buffers are the answer to two big problems I was facing with Gazelle:
- byte-code format: right now the Gazelle byte-code format is LLVM’s BitCode, which is the format LLVM uses for storing its byte-code internally. I invested a lot into BitCode (you’ll notice my name is on the linked document), including writing a standalone encoder and decoder (230 lines of Lua and 856 lines of C, respectively). But this was before I worked at Google or knew about Protocol Buffers. Protocol Buffers are much easier to use because they have a formal schema (the .proto file) that can generate nice APIs and help you out with backward compatibility. Without a format schema, BitCode makes you resort to things like an ad hoc text file that describes the schema. This approach was showing its limits.
- parse tree format: I always knew I wanted Gazelle to be capable of generating parse trees in some kind of standard format. Protocol Buffers end up being a match made in heaven, since they are isomorphic to parse trees in a very deep way. Indeed, the ast system for the Elkhound Parser is very much like Protocol Buffers in that you define your parse tree format and it generates classes for representing your AST.
Since Gazelle is gated on upb, the question then is: when will upb release? Why hasn’t it released already?
A few months ago I was working on upb for 100% of my time at work. I had banked 20% time for a while, and I was also a bit burned out on my 80% project, so my manager very graciously gave me the liberty to work on upb for all of my working hours.
During that time upb made progress in several areas. It got some better benchmarks and tests, and I fleshed out the upb compiler so that it wasn’t dependent on the official Protocol Buffers compiler for bootstrapping. Maybe most importantly, I worked a lot on the in-memory message format to figure out how to make it work well with dynamic languages.
My goal during that time was to write a Python extension that a few initial internal-to-Google customers could use. The value proposition is that it would be API-compatible with what they were already using, but many times faster. I wrote said extension, which was incomplete (supported decoding only, not encoding), but looked complete enough to use for this case.
By this time I was approaching the amount of time I could reasonably ask from my manager at work, so I had to tie up the loose ends and get it into my initial customer’s hands. I put all the pieces together and tried it out, but then ran into a problem; I hadn’t realized that this initial customer was using an old deprecated feature of Protocol Buffers called MessageSet. There was no way I could support MessageSet without significant changes. I was defeated for the moment. I had to take a break for a few months and re-devote my time to my 80% project.
I mention this all just to illustrate that I do have actual customers that I am targeting, and I have had aggressive pushes to deliver something to those customers, but unfortunately my work wasn’t complete enough for them yet.
This brings us up to now. In the last week or two, I have made several strides, including executing on part of a design that will get me MessageSet support. I have also developed an interface for a “pick parser”, which lets you pull only a small subset of fields out of a protobuf. This will be a big win for use cases that only need a few fields from a very large proto, and I have a customer internal-to-Google who is very interested in this interface.
Meanwhile I’m very interested in trying to get the upb Python extension into AppEngine, because I think it could be a huge win there since users aren’t allowed to load custom Python extensions. This means that currently, people trying to use protocol buffers on AppEngine are limited to pure-Python extensions that are much slower than a C extension can be. But to get into AppEngine I will need to get a security audit, which is part of the reason I am leaning towards C++ at this point. I think C++ will make the code shorter and less gnarly (fewer casts), which should lead to easier verifiability. I converted one header file so far, and it got 38% smaller and much easier to read.
I hesitate to make schedule estimates, but my main purpose is to impress on my possibly-impatient audience that:
- I do have motivation to release.
- I do have initial customers and initial use cases.
- I am making progress.
- I am currently focused on delivering (1) a pick parser, (2) a Python extension, (3) an easily-auditable code-base.
- I look forward to being able to announce my first release!
Yours,
Josh
Recent Comments