Archive

Archive for June, 2009

Books About Parsing

June 26th, 2009 Josh 3 comments

I’m in Portland this weekend and picked up a few more parsing-related books. I wanted to list them here, since I wasn’t previously aware they existed despite my intense interest in this field.

knuth

The first is Selected Papers on Computer Languages, by Donald Knuth. I’m embarrassed to say that I wasn’t previously aware how much of the early theories and models about parsing come from Knuth. This book contains several of the original papers about the theories of LL(k), LR(k), etc:


The second is The Theory of Parsing, Translation, and Compiling. Volume I: Parsing, by Alfred Aho and Jeffrey Ullman. If those names sound familiar, it’s because they’re the guys who wrote the Dragon Book.


The last is Beautiful Code: Leading Programmers Explain How They Think. I had seen this book at various bookstores before but never felt compelled to look at it. But today I flipped through it and realized that many of the essays are about parsing:

  • A Regular Expression Matcher, by Brian Kernighan. “I suggested to Rob [Pike] that we find the smallest regular expression package that would illustrate the basic ideas while still recognizing a useful and nontrivial class of patterns. Ideally the code would fit on a single page. Rob disappeared into his office. As I remember it now, he emerged in no more than an hour or two with [these] 30 lines of C code.
  • Finding Things, by Tim Bray. Deals with searching logfiles with regular expressions.
  • Correct, Beautiful, Fast (in That Order): Lessons from Designing XML Verifiers, by Elliotte Rusty Harold. Deals with parsing/verifying XML, what more needs to be said?
  • Python’s Dictionary Implementation: Being All Things to All People, by Andrew Kuchling. Not about parsing, but about hashing, which I have quite recently been very interested in.


pt2ed

And though I mentioned it once before, I just want to reiterate how awesome Parsing Techniques: A Practical Guide is. Your guidebook to the entire field of parsing.

Categories: Uncategorized Tags:

μpb Status Update

June 26th, 2009 Josh No comments

I haven’t posted many status updates for upb lately. Sometimes that means I’m busy with other things, but right now it means I am working on it feverishly and can hardly stand to take a break from it.

I’m extremely happy with how it’s shaping up. It’s getting more and more complete, and yet still staying quite “micro.” Notably, it just recently crossed 1500 lines of C (I don’t count auto-generated code in this), and it compiles to not quite 10kb of object code. Throw in the auto-generated code (reflection data that describes the types in descriptor.proto — this is what allows loading other proto definitions at runtime) and this jumps to about 16kb. Keep in mind that this effectively has all the major features of the main protobuf implementation (albeit packaged in different ways), but the main protobuf implementation is just over 1MB of object code!

I still think I can reach the main protobuf implementation’s performance, or maybe even exceed it. I will consider the project a failure if I can’t get within 15% of its performance.

Anyway, just wanted to be sure everyone knows I’m still alive and very much working on this. Now back to work.

Categories: upb Tags:

Bit-fields in C99

June 26th, 2009 Josh 2 comments

Recently I came upon some spirited discussion on reddit concerning a blog post that discussed the use of bit-fields in C. As a quick refresher to anyone unfamiliar or rusty on bit-fields, they are a construct in C that lets you specify how many bits different members of a structure should be allocated:

struct foo {
  unsigned int a:1;
  unsigned int b:2;
}

Without the bit-field specifiers (the “:1″ and “:2″ above) this structure would be at least four bytes long, and both “a” and “b” would be capable of representing integers from 0 to 65535. With the bit-field specifiers the structure is likely only one byte long, and “a” and “b” can only represent 0-1 and 0-3, respectively. Bit fields are a way to store many distinct values compactly inside a struct, to save memory.

I am using bitfields in upb to store flags for whether each field is set or not. When I generate a structure definition for a specific proto type, code in the header file looks something like this:

struct google_protobuf_DescriptorProto {
  union {
    uint8_t bytes[1];
    struct {
      bool name:1;  /* = 1, optional. */
      bool field:1;  /* = 2, repeated. */
      bool nested_type:1;  /* = 3, repeated. */
      bool enum_type:1;  /* = 4, repeated. */
      bool extension_range:1;  /* = 5, repeated. */
      bool extension:1;  /* = 6, repeated. */
      bool options:1;  /* = 7, optional. */
    } has;
  } set_flags;
  struct upb_string* name;
  UPB_STRUCT_ARRAY(google_protobuf_FieldDescriptorProto)* field;
  UPB_STRUCT_ARRAY(google_protobuf_FieldDescriptorProto)* extension;
  UPB_STRUCT_ARRAY(google_protobuf_DescriptorProto)* nested_type;
  UPB_STRUCT_ARRAY(google_protobuf_EnumDescriptorProto)* enum_type;
  UPB_STRUCT_ARRAY(google_protobuf_DescriptorProto_ExtensionRange)* extension_range;
  google_protobuf_MessageOptions* options;
};

(that’s from descriptor.h, which is automatically generated from descriptor.proto from the official protobuf implementation).

I “union” the bitfield with bytes so I can set the bytes en-masse more efficiently. This seems like a really nice solution, because then client code that wants to test whether a field is set in a protobuf can write code like:

if(fd->set_flags.has.message_type) {
   // ...
}

So unfortunately the commenters on Reddit were pretty down on bit-fields and the aforementioned article. The points they raised filled me with despair that I would have to abandon bit-fields. This is bad because the only real alternative I would have for this kind of code generation is to generate explicit getters and setters for each bit of each structure! The code would then immediately become quite uglified:

if(google_protobuf_DescriptorProto_has_message_type(fd)) {
  // ...
}

Besides being longer and forcing you to re-type the whole name of the enclosed type, this approach forces you to generate lots of nearly-identical code and pollute the symbol namespace with tons of these useless functions that add no real value. I was bumming about this.

But as I investigated the supposed problems with bit-fields, I became convinced that the problems were tractable and that I could in good conscience press forward with my plans to use them as intended.

Here Be Dragons

“Almost everything about [bit] fields is implementation-dependent.” –K&R, Section 6.9

From my reading of the C99 standard, here are the guarantees you get (or don’t get) with bit-fields.

1. Bit-fields will be packed as tightly as possible, provided they don’t cross storage unit boundaries: YES. From the standard: (6.7.2.1 #10):

An implementation may allocate any addressable storage unit large enough to hold a bit-field. If enough space remains, a bit-field that immediately follows another bit-field in a structure shall be packed into adjacent bits of the same unit.

The first part just means that the compiler can allocate any number of bytes to a bit-field, as long as it has enough bits. So if you write:

struct foo {
  unsigned int a:8;
}

…a will be at least 1 byte long, but could be more. But the second part is where we get the nice guarantee that:

struct foo {
  unsigned int a:1;
  unsigned int b:2;
}

…will always be packed into a single byte. sizeof(struct foo) could be greater than 1, but you are guaranteed that a and b are in the same byte.

2. Bits are allocated low-to-high (or high-to-low): NO. This is not guaranteed by the spec, so an implementation could choose to allocate bits either starting at the high bit or the low bit of each byte. The following program will test an implementation to see which it is:

#include <stdio.h>
int main() {
  union {
    struct {
      unsigned int a:1;
      unsigned int b:2;
      unsigned int c:3;
    } bitfield;
    unsigned char ch;
  } u = {.ch = 0};
 
  u.bitfield.a = 1;
  u.bitfield.c = 7;
  switch(u.ch) {
    case 0x39: printf("Bits allocated LOW-TO-HIGH.\n"); break;
    case 0x9C: printf("Bits allocated HIGH-TO-LOW.\n"); break;
    default: printf("Oddball allocation: 0x%02hhx.\n", u.ch); break;
  }
  return 0;
}

My Goal

What I am trying to do is create a C struct like the above, but also be able to perform data-driven reflection on that C struct based on generic routines like:

INLINE bool upb_msg_is_set(void *s, int index)
{
  return ((char*)s)[index / 8] & (1 << (index % 8));
}

This means that my reflection routines like the one above need to be able to mimic the implementation-defined bit ordering as described above. The routine above does not — it assumes that the bits are assigned from low to high. But a few #ifdefs should make the above not too difficult.

Performance

In theory, the ideal compiler should love bit-fields because they give it maximum information with which to optimize loads and stores. I see a lot of people claim that some compilers generate bad code for bit-fields, but that is a possibility with any construct you use. From what I can see GCC generates very good code for bit-fields.

So my conclusion is that the hurdles of bit-fields are tractable, and unless some new information comes my way, I plan to move forward with my plan to use them.

Categories: upb Tags: