Books About Parsing

Posted by josh at June 26th, 2009

I’m in Portland this weekend and picked up a few more parsing-related books. I wanted to list them here, since I wasn’t previously aware they existed despite my intense interest in this field.

The first is Selected Papers on Computer Languages, by Donald Knuth. I’m embarrassed to say that I wasn’t previously aware how much of the early theories and models about parsing come from Knuth. This book contains several of the original papers about the theories of LL(k), LR(k), etc:

The second is The Theory of Parsing, Translation, and Compiling. Volume I: Parsing, by Alfred Aho and Jeffrey Ullman. If those names sound familiar, it’s because they’re the guys who wrote the Dragon Book.

The last is Beautiful Code: Leading Programmers Explain How They Think. I had seen this book at various bookstores before but never felt compelled to look at it. But today I flipped through it and realized that many of the essays are about parsing:

  • A Regular Expression Matcher, by Brian Kernighan. “I suggested to Rob [Pike] that we find the smallest regular expression package that would illustrate the basic ideas while still recognizing a useful and nontrivial class of patterns. Ideally the code would fit on a single page. Rob disappeared into his office. As I remember it now, he emerged in no more than an hour or two with [these] 30 lines of C code.
  • Finding Things, by Tim Bray. Deals with searching logfiles with regular expressions.
  • Correct, Beautiful, Fast (in That Order): Lessons from Designing XML Verifiers, by Elliotte Rusty Harold. Deals with parsing/verifying XML, what more needs to be said?
  • Python’s Dictionary Implementation: Being All Things to All People, by Andrew Kuchling. Not about parsing, but about hashing, which I have quite recently been very interested in.

And though I mentioned it once before, I just want to reiterate how awesome Parsing Techniques: A Practical Guide is. Your guidebook to the entire field of parsing.

pt2ed

Posted in Uncategorized| 3 Comments | 

μpb Status Update

Posted by josh at June 26th, 2009

I haven’t posted many status updates for upb lately. Sometimes that means I’m busy with other things, but right now it means I am working on it feverishly and can hardly stand to take a break from it.

I’m extremely happy with how it’s shaping up. It’s getting more and more complete, and yet still staying quite “micro.” Notably, it just recently crossed 1500 lines of C (I don’t count auto-generated code in this), and it compiles to not quite 10kb of object code. Throw in the auto-generated code (reflection data that describes the types in descriptor.proto — this is what allows loading other proto definitions at runtime) and this jumps to about 16kb. Keep in mind that this effectively has all the major features of the main protobuf implementation (albeit packaged in different ways), but the main protobuf implementation is just over 1MB of object code!

I still think I can reach the main protobuf implementation’s performance, or maybe even exceed it. I will consider the project a failure if I can’t get within 15% of its performance.

Anyway, just wanted to be sure everyone knows I’m still alive and very much working on this. Now back to work.

Posted in upb| No Comments | 

Bit-fields in C99

Posted by josh at June 26th, 2009

Recently I came upon some spirited discussion on reddit concerning a blog post that discussed the use of bit-fields in C. As a quick refresher to anyone unfamiliar or rusty on bit-fields, they are a construct in C that lets you specify how many bits different members of a structure should be allocated:

struct foo {
  unsigned int a:1;
  unsigned int b:2;
}

Without the bit-field specifiers (the “:1″ and “:2″ above) this structure would be at least four bytes long, and both “a” and “b” would be capable of representing integers from 0 to 65535. With the bit-field specifiers the structure is likely only one byte long, and “a” and “b” can only represent 0-1 and 0-3, respectively. Bit fields are a way to store many distinct values compactly inside a struct, to save memory.

I am using bitfields in upb to store flags for whether each field is set or not. When I generate a structure definition for a specific proto type, code in the header file looks something like this:

struct google_protobuf_DescriptorProto {
  union {
    uint8_t bytes[1];
    struct {
      bool name:1;  /* = 1, optional. */
      bool field:1;  /* = 2, repeated. */
      bool nested_type:1;  /* = 3, repeated. */
      bool enum_type:1;  /* = 4, repeated. */
      bool extension_range:1;  /* = 5, repeated. */
      bool extension:1;  /* = 6, repeated. */
      bool options:1;  /* = 7, optional. */
    } has;
  } set_flags;
  struct upb_string* name;
  UPB_STRUCT_ARRAY(google_protobuf_FieldDescriptorProto)* field;
  UPB_STRUCT_ARRAY(google_protobuf_FieldDescriptorProto)* extension;
  UPB_STRUCT_ARRAY(google_protobuf_DescriptorProto)* nested_type;
  UPB_STRUCT_ARRAY(google_protobuf_EnumDescriptorProto)* enum_type;
  UPB_STRUCT_ARRAY(google_protobuf_DescriptorProto_ExtensionRange)* extension_range;
  google_protobuf_MessageOptions* options;
};

(that’s from descriptor.h, which is automatically generated from descriptor.proto from the official protobuf implementation).

I “union” the bitfield with bytes so I can set the bytes en-masse more efficiently. This seems like a really nice solution, because then client code that wants to test whether a field is set in a protobuf can write code like:

if(fd->set_flags.has.message_type) {
   // ...
}

So unfortunately the commenters on Reddit were pretty down on bit-fields and the aforementioned article. The points they raised filled me with despair that I would have to abandon bit-fields. This is bad because the only real alternative I would have for this kind of code generation is to generate explicit getters and setters for each bit of each structure! The code would then immediately become quite uglified:

if(google_protobuf_DescriptorProto_has_message_type(fd)) {
  // ...
}

Besides being longer and forcing you to re-type the whole name of the enclosed type, this approach forces you to generate lots of nearly-identical code and pollute the symbol namespace with tons of these useless functions that add no real value. I was bumming about this.

But as I investigated the supposed problems with bit-fields, I became convinced that the problems were tractable and that I could in good conscience press forward with my plans to use them as intended.

Here Be Dragons

“Almost everything about [bit] fields is implementation-dependent.” –K&R, Section 6.9

From my reading of the C99 standard, here are the guarantees you get (or don’t get) with bit-fields.

1. Bit-fields will be packed as tightly as possible, provided they don’t cross storage unit boundaries: YES. From the standard: (6.7.2.1 #10):

An implementation may allocate any addressable storage unit large enough to hold a bit-field. If enough space remains, a bit-field that immediately follows another bit-field in a structure shall be packed into adjacent bits of the same unit.

The first part just means that the compiler can allocate any number of bytes to a bit-field, as long as it has enough bits. So if you write:

struct foo {
  unsigned int a:8;
}

…a will be at least 1 byte long, but could be more. But the second part is where we get the nice guarantee that:

struct foo {
  unsigned int a:1;
  unsigned int b:2;
}

…will always be packed into a single byte. sizeof(struct foo) could be greater than 1, but you are guaranteed that a and b are in the same byte.

2. Bits are allocated low-to-high (or high-to-low): NO. This is not guaranteed by the spec, so an implementation could choose to allocate bits either starting at the high bit or the low bit of each byte. The following program will test an implementation to see which it is:

#include 
int main() {
  union {
    struct {
      unsigned int a:1;
      unsigned int b:2;
      unsigned int c:3;
    } bitfield;
    unsigned char ch;
  } u = {.ch = 0};

  u.bitfield.a = 1;
  u.bitfield.c = 7;
  switch(u.ch) {
    case 0x39: printf("Bits allocated LOW-TO-HIGH.\n"); break;
    case 0x9C: printf("Bits allocated HIGH-TO-LOW.\n"); break;
    default: printf("Oddball allocation: 0x%02hhx.\n", u.ch); break;
  }
  return 0;
}

My Goal

What I am trying to do is create a C struct like the above, but also be able to perform data-driven reflection on that C struct based on generic routines like:

INLINE bool upb_msg_is_set(void *s, int index)
{
  return ((char*)s)[index / 8] & (1 << (index % 8));
}

This means that my reflection routines like the one above need to be able to mimic the implementation-defined bit ordering as described above. The routine above does not — it assumes that the bits are assigned from low to high. But a few #ifdefs should make the above not too difficult.

Performance

In theory, the ideal compiler should love bit-fields because they give it maximum information with which to optimize loads and stores. I see a lot of people claim that some compilers generate bad code for bit-fields, but that is a possibility with any construct you use. From what I can see GCC generates very good code for bit-fields.

So my conclusion is that the hurdles of bit-fields are tractable, and unless some new information comes my way, I plan to move forward with my plan to use them.

Posted in upb| 1 Comment | 

pbstream’s name

Posted by josh at May 3rd, 2009

I hate naming things. I’m starting to realize that pbstream is outgrowing its name. That’s the one thing I always thought was brilliant about the iPod’s name. When the iPod started playing video it was no problem — it’s a pod! Why shouldn’t a pod play video? There’s no way to outgrow the name “iPod.” Brilliant.

I had no such luck with pbstream. The tagline is currently:

pbstream - a stream-oriented implementation of protocol buffers.

It’s true that offering streaming semantics is one distinguishing feature of my implementation, but the way it’s growing it’s no longer really “stream-oriented.” It will offer both stream-oriented and structure-oriented semantics. Both will be equally supported/encouraged — it will be a matter of what best suits your needs.

If anything, the number one distinguishing feature of my implementation is that it is minimal. It gives you a set of tools to use whatever paradigm is the right trade-off for you. It gives you building blocks to assemble as you see fit. There are never “riders” — things like memory management that you have to take along with the parts you really want.

So what’s in a name? Desired characteristics:

  • Unique, reasonably Googleable.
  • Communicates that it is a protobuf implementation, and if possible communicates the philosophy described above.
  • The name (or a reasonable abbreviation thereof) makes for a nice prefix that can be appended to all my identifiers. Right now it’s pbstream_this and pbstream_that.

Originally I just called it “pb” but that was a tad bit too generic. I slightly like “pblab” (communicates that it gives you building blocks for forging your own strategies), but I dislike the connotation that it is unfinished, unpolished, not production quality.

Perhaps I could give it a name, but keep using “pb_” in my identifiers. It could also help me organize the names of the different parts. I could use names like:

  • pb_stream_parser_* for the stream parser.
  • pb_struct_* for the code that defines an in-memory structure.

I kind of like that. But I still need a name for the package itself that is better than “pb”. Ideas?

Posted in upb| 9 Comments | 

“Hey! It’s been months! What happened?”

Posted by josh at May 3rd, 2009

…reads the latest comment on my last entry. It’s true, I’ve gone silent for a few months. Work on pbstream has languished. The answer to “what happened?” is a combination of a personal life that got crazy and a brick wall I hit in the design of pbstream. But I think I have resolved the latter of those at least and work is progressing again.

So what is the brick wall I hit with pbstream? I had conflicting goals that I couldn’t figure out how to reconcile. One goal was to include as little policy related to memory management as humanly possible. In other words, if you’re using pbstream to store fully-parsed protobufs in memory — I’ll use the SAX vs. DOM analogy again, where this is the DOM case — then I want pbstream to not define any semantics for how the memory for that tree is managed. Is it reference-counted, garbage-collected, or is it just allocated/deallocated in one big chunk? Every runtime already has its own answer to this question. I don’t want to include a memory management strategy, which adds to the code size and complexity, just to have it be an annoying thing that you have to integrate into whatever runtime you’re already using.

Or, as I said in a previous entry: Oh fantastic! Because definitely the one thing that my application doesn’t already have is a memory manager.

On the other hand, I realized that I really do want the ability to pass an in-memory protobuf representation between language implementations. For example, if there’s a C library that conjures up some complex protobuf, I want to be able to pass that parsed in-memory protobuf to Python without serializing/deserializing. I want different runtimes to be able to look at this memory and make sense of it.

So far so good. This alone doesn’t require memory management. But I also wanted two things that do pose a memory-management problem:

  • I wanted to pass protobufs between language implementations without copying.
  • I wanted one language implementation to be able to modify the protobuf that had been created by a different language implementation.

The first is a problem because if you take a protobuf you created in C and pass it to Python, once that Python returns C has no idea whether Python has taken references to any of the protobuf’s data. If C frees the protobuf, Python could try to later read from the freed memory. If C doesn’t free the protobuf, the memory will leak. The fundamental problem is that C and Python have no way of cooperating to be sure the memory is freed only when the last reference that either of them holds is dropped.

The second is a problem for a similar reason. Suppose you take a protobuf you created in C and pass it to Python, and Python wants to mutate that protobuf. Suppose the protobuf has a submessage, which is implemented as a reference to that other message. If in python you say:

message.submessage = other_submessage

…Python now has to decide what to do with the submessage that message already had. Should it free it or not? If you free it and C had a reference to it, C references freed memory. If you don’t free it and C doesn’t have a reference to it, it leaks.

I fretted over the question of what to do about this for weeks. I didn’t want to give up the ability to pass protobufs between languages without serializing/deserializing. But I was loath to include the slightest amount of memory management code because of the implied complexity, run-time overhead, and code size effects. So everything ground to a halt while I tried to reconcile these conflicting desires in my head.

At one point I convinced myself that the only way out was reference-counting. The in-memory protobufs themselves would contain one extra integer for the reference count, and the code would contain just these little extra reference count increments/decrements and checks. Then C and Python would both maintain reference counts to these shared data structures. For Python it would be a sort of two-stage reference counting, since the individual Python objects would be reference-counted, and when their reference count dropped to zero they would decrement the reference count of the shared protobuf structure one.

Eww, Eww, EWW!! Now maybe you can understand a bit better why I haven’t gotten anything done for a while. There’s no way I could be very happy implementing this scheme.

So quite recently I found a way to relax the sharing requirements just enough that I can get away with not implementing any memory management. My new strategy is:

  • language A can reference language B’s protobufs, but only for as long as B guarantees that all the references will continue to be valid. In most cases this just means that when B calls A and passes a protobuf, A can only reference B’s protobufs before it returns control back to B.
  • In general, languages cannot mutate each other’s protobufs unless they have some special arrangement with regards to memory management. I thought I really needed this capability in Gazelle, but I’m no longer convinced I do.

So with that resolved, work is resuming, and I hope to have some results soon. I always hate making promises though, because I’m unwilling to press on when I don’t have a satisfactory answer to a question (as the last month demonstrates), and this can delay my progress long beyond my original intentions.

Posted in upb| No Comments | 

Next Postings »