GCC: the impressive and the disappointing

In my work on upb I’ve looked at a lot of compiler-generated assembly code. I frequently want to know how GCC will compile a certain block of code, so I’ll write a little test program in C and use objdump to look at the object file.

Over two years of doing this, I’ve had many moments where I am pleasantly surprised at how smart the compiler is being. C compilers can figure out a lot. I demonstrated a few examples of this on stackoverflow.com — see here and here.

But once in a while I am disappointed, because GCC doesn’t figure something out that I really wish it could. For example, I have a situation where I have a callback interface, but one of the parameters to my callback is something that a lot of clients don’t need. For convenience, I want to let them register a callback that doesn’t take the final parameter. But to be ANSI C, I can’t cast between the two callback types, I need to store the two possible callback types in a union. My quick test looked like this:

// My two callback types -- the second takes an extra parameter.
typedef union {
  void (*f1)(int);
  void (*f2)(int, int);
} funcs;

// I store "which" separately to track which type of callback was registered.
void foo(funcs f, int which) {
  int a = 5, b = 10;
  if (which) {
    f.f1(a);
  } else {
    f.f2(a, b);
  }
}

Now on x86-64, parameters are passed in registers (thank god!), so there’s no actual reason to branch here. You might as well just always put both values in registers, because if we were calling “f1″ it will just ignore the register that’s holding the second parameter, or overwrite it which is fine too. The only reason we put the “if” in the C code was to be ANSI C compilant — according to the standard you can’t cast between function pointer types.

But alas, GCC didn’t figure this out:

0000000000000000 :
   0:	85 f6                	test   esi,esi
   2:	48 89 f8             	mov    rax,rdi
   5:	75 11                	jne    18 
   7:	be 0a 00 00 00       	mov    esi,0xa
   c:	bf 05 00 00 00       	mov    edi,0x5
  11:	ff e0                	jmp    rax
  13:	0f 1f 44 00 00       	nop    DWORD PTR [rax+rax*1+0x0]
  18:	bf 05 00 00 00       	mov    edi,0x5
  1d:	ff e0                	jmp    rax

Notice it put a branch in there (“jne”) just to avoid putting the value 10 (0xa) in register esi for the one-argument path.

It’s even worse if I give both functions one parameter, but of different types:

typedef union {
  void (*f1)(int);
  void (*f2)(long);
} funcs;

void foo(funcs f, int which) {
  int a = 5;
  if (which) {
    f.f1(a);
  } else {
    f.f2(a);
  }
}

In this case, GCC again generates a branch, but both paths have identical code in them!

0000000000000000 :
   0:	85 f6                	test   esi,esi
   2:	48 89 f8             	mov    rax,rdi
   5:	75 09                	jne    10 
   7:	bf 05 00 00 00       	mov    edi,0x5
   c:	ff e0                	jmp    rax
   e:	66 90                	xchg   ax,ax
  10:	bf 05 00 00 00       	mov    edi,0x5
  15:	ff e0                	jmp    rax

Both branches simply move 5 into edi and jump to rax. There is absolutely no reason to branch here. Sigh.

C compilers are smart, but have their limits. Another thing this demonstrates is how programming in C can be a bit constraining compared to assembly language. The only reason I’m jumping through these hoops to begin with is that C has very strict rules about pointer conversion: you can’t just go around casting one function pointer type to another, because you’ll get undefined behavior. But if you’re programming in assembly there is no undefined behavior, no worrying about aliasing, etc. The code I’m trying to get C to generate in a standards-compliant way would be trivial to write in assembly language directly.

Of course in that case I’d have to implement the assembly language on every platform I wanted to target. In the end I’ll probably use the branchy version that the compiler will generate; the branch will probably predict pretty well, and more importantly for the real fast path for protobuf decoding I have a JIT on the way…

Posted in Uncategorized | 19 Comments

Gazelle being integrated into a syntax aware editor

I’ve been meaning to post this for several days now: I was very excited to hear on the gazelle-users mailing list that Gazelle is being integrated into a syntax-aware editor called Kod. See the screencast for a preliminary demo.

This is very exciting for me, because the premise that a text editor could use a real parser as you type was one of my major motivations for creating Gazelle. Most text editors clever pattern matching for the syntax highlighting, but since they aren’t real parsers they can get confused. More importantly, they can’t give you true syntax-aware completion. The notable exception is Java editors like Eclipse, which do this quite well, and I hear that Microsoft’s IDEs are good at it too. But I always wanted an editor where you could add new languages easily and have more ability to tinker with how this syntax information is used. I think it could open up so many doors.

Given this, I was very excited when Rasmus Andersson from the Kod project wrote to the gazelle-users mailing list asking if Gazelle was a dead project, or if it was something he and his team should invest effort into integrating with Kod. I replied that it was definitely not a dead project, but that it is still subject to major change. That didn’t seem to dissuade him though, and I look forward to learning from their needs and folding their improvements back into Gazelle.

Posted in Gazelle | Leave a comment

Gazelle/upb status

It has been just over a year since I last posted, leading some people to rightfully wonder whether my projects Gazelle and upb are abandoned. The answer to that question is a resounding “no.” I am more motivated to complete Gazelle and upb than I have ever been, and I have been working on upb actively lately (here are my recent commits).

However, if I’ve learned one thing over the last several years of working on Gazelle and upb, it’s that I am extraordinarily bad at knowing how close I am to being ready to release. I don’t want to make any predictions or promises. I honestly thought I was almost ready to release upb a year and a half ago, but I’ve almost completely rewritten it twice since then. Each time it gets significantly better and closer to being what I want it to be. Once I have a release I’ll describe more about the stages of evolution it went through and how each iteration was objectively better than the one before it.

The core interfaces have gotten to a point where I’m really happy with them and feel no more need to rework them. I think they will continue to evolve incrementally, but not in a way that requires redoing them completely. Here are the most core interfaces — if you’re interested in upb, I recommend reading these headers. I’ve just added substantial comments to explain them, and more than anything these will give you a taste of what upb is all about:

  • upb_stream.h, the streaming interfaces for doing SAX-like tree traversal of protobuf data, and abstractions of fread()/fwrite(). These are probably the most important interfaces in all of upb, since all of the encoders and decoders are based on them.
  • upb_string.h, an immutable, length-delimited (instead of NULL-terminated), reference-counted string type.
  • upb_def.h, the data structures for a protobuf schema (.proto file) and routines for loading them.

By the way, I also mean to write zero-overhead C++ wrappers around the above to give you C++ programmers a nicer interface at no cost.

With those set, I have been rapidly getting everything building/working again. It’s a bit annoying to rewrite upb_def.c for the third time (literally) but it feels good knowing the interfaces are right.

So I have renewed optimism that I’ll be releasing soon. And once I’m happy with upb, it’s back to Gazelle.

Posted in Gazelle, upb | 4 Comments

Torn over the C++ question

I am having a very difficult time deciding whether to go through with the C++ port of upb or to stay in C.

I’ve ported about one third of upb to C++, on a branch, to see how it would turn out. It was a ton of work. Here are my current observations:

  • The C++ is cleaner, more readable, less error-prone code. It’s just a fact. Compare for yourself (C: upb_def.h, upb_def.c; C++: upb_def.h, upb_def.cc). This is due to numerous factors:
    • type-safe containers means fewer casts.
    • “public” and “private” keywords make it easy to separate the private parts of your interface, without having to specify in comments which is which.
    • namespaces and class scope mean that I don’t have to write out my identifiers like upb_fielddef_dothis(), I can just write DoThis().
    • real inheritance and member classes mean I don’t have to explicitly call all the right constructors/destructors, or write explicit casts for upcasts
    • destructors that are guaranteed to run on scope exit mean I can use RAII patterns like mutexes that automatically unlock when the scope is exited
  • The source got shorter; the portion I ported went from 1483 lines to 1133, or a ~30% reduction.
  • The binary got a LOT bigger. I had one function get literally 5x as big. I haven’t figured out why this happened yet. I used templates to make the table generic, but I was extremely careful to make sure that the template only generated a small amount of code — basically just the hash lookup routine, which is small (note: the hash function for strings was not templated or inlined). But another issue is that the C++ compiler appears to emit multiple copies of the same function in the same object file! For example, I found some virtual destructors emitted literally three times in the same file. Why is this?
  • I just heard back from a security guru from the Google security team, who said that C is often easier to audit than C++ because it’s easier to figure out what is actually going on, without having to dig through layers of abstraction. This surprised me (maybe it shouldn’t have, since Sam Quigley said the same thing in a comment on my last entry), but I was also a little bit relieved.

I’m leaning towards sticking with C, for the following reasons:

  • C++ compilers aren’t very good at keeping things small, even when you are juducious with your use of templates.
  • C++ compilers are much more complicated that C compilers, and therefore not as ubiquitous or as easy to trust generally.
  • C isn’t harder to audit for security than C++, and may actually be easier.

I’ll try to take some of the lessons I learned from my partial C++ port to make the C more readable.

Posted in Uncategorized | 8 Comments

Gazelle/upb status and plans (aka: On Releasing)

This summer my friends Ben and Mike gave me grief about never releasing anything. Their criticism is definitely valid to some degree. I’ve been working on Gazelle for about two years now, and upb for almost one. Gazelle has had four releases in that time, but they have mostly focused on moving Gazelle to where I think it ought to be, as opposed to releasing something hacky that people can actually use now. There is a class of problems that Gazelle is useful for now, but it is pretty small in comparison to the amount of work I’ve put in.

I haven’t released upb at all yet, and my last message indicating I’m thinking of porting it to C++ will probably make skeptical readers think I’m moving farther away from a release rather than closer to one.

Since I agree that my progress doesn’t look too promising to someone observing from the outside, let me say where I think these projects currently are, where they’re going, and when they’re likely to release.

First of all, Gazelle is currently pushed on the stack until I have upb released. The reason is that I realized that Protocol Buffers are the answer to two big problems I was facing with Gazelle:

  1. byte-code format: right now the Gazelle byte-code format is LLVM’s BitCode, which is the format LLVM uses for storing its byte-code internally. I invested a lot into BitCode (you’ll notice my name is on the linked document), including writing a standalone encoder and decoder (230 lines of Lua and 856 lines of C, respectively). But this was before I worked at Google or knew about Protocol Buffers. Protocol Buffers are much easier to use because they have a formal schema (the .proto file) that can generate nice APIs and help you out with backward compatibility. Without a format schema, BitCode makes you resort to things like an ad hoc text file that describes the schema. This approach was showing its limits.
  2. parse tree format: I always knew I wanted Gazelle to be capable of generating parse trees in some kind of standard format. Protocol Buffers end up being a match made in heaven, since they are isomorphic to parse trees in a very deep way. Indeed, the ast system for the Elkhound Parser is very much like Protocol Buffers in that you define your parse tree format and it generates classes for representing your AST.

Since Gazelle is gated on upb, the question then is: when will upb release? Why hasn’t it released already?

A few months ago I was working on upb for 100% of my time at work. I had banked 20% time for a while, and I was also a bit burned out on my 80% project, so my manager very graciously gave me the liberty to work on upb for all of my working hours.

During that time upb made progress in several areas. It got some better benchmarks and tests, and I fleshed out the upb compiler so that it wasn’t dependent on the official Protocol Buffers compiler for bootstrapping. Maybe most importantly, I worked a lot on the in-memory message format to figure out how to make it work well with dynamic languages.

My goal during that time was to write a Python extension that a few initial internal-to-Google customers could use. The value proposition is that it would be API-compatible with what they were already using, but many times faster. I wrote said extension, which was incomplete (supported decoding only, not encoding), but looked complete enough to use for this case.

By this time I was approaching the amount of time I could reasonably ask from my manager at work, so I had to tie up the loose ends and get it into my initial customer’s hands. I put all the pieces together and tried it out, but then ran into a problem; I hadn’t realized that this initial customer was using an old deprecated feature of Protocol Buffers called MessageSet. There was no way I could support MessageSet without significant changes. I was defeated for the moment. I had to take a break for a few months and re-devote my time to my 80% project.

I mention this all just to illustrate that I do have actual customers that I am targeting, and I have had aggressive pushes to deliver something to those customers, but unfortunately my work wasn’t complete enough for them yet.

This brings us up to now. In the last week or two, I have made several strides, including executing on part of a design that will get me MessageSet support. I have also developed an interface for a “pick parser”, which lets you pull only a small subset of fields out of a protobuf. This will be a big win for use cases that only need a few fields from a very large proto, and I have a customer internal-to-Google who is very interested in this interface.

Meanwhile I’m very interested in trying to get the upb Python extension into AppEngine, because I think it could be a huge win there since users aren’t allowed to load custom Python extensions. This means that currently, people trying to use protocol buffers on AppEngine are limited to pure-Python extensions that are much slower than a C extension can be. But to get into AppEngine I will need to get a security audit, which is part of the reason I am leaning towards C++ at this point. I think C++ will make the code shorter and less gnarly (fewer casts), which should lead to easier verifiability. I converted one header file so far, and it got 38% smaller and much easier to read.

I hesitate to make schedule estimates, but my main purpose is to impress on my possibly-impatient audience that:

  • I do have motivation to release.
  • I do have initial customers and initial use cases.
  • I am making progress.
  • I am currently focused on delivering (1) a pick parser, (2) a Python extension, (3) an easily-auditable code-base.
  • I look forward to being able to announce my first release!

Yours,
Josh

Posted in Uncategorized | 4 Comments