Archive

Archive for March, 2009

C Compilers

March 3rd, 2009 Josh 3 comments

So the more I work on pbstream (and I’m working on it quite a lot) the more obsessed I get with making it perfect. I’m no longer just out to just write some good software, my new goal is to utterly demolish this problem space with blindingly utter perfection. It’s going to be one of the best things I’ve ever done.

So my latest OCD goal this quest is to make it truly 100% C99 compliant: compilable on the weirdest and most nontraditional target, and free of any undefined behavior. I like most people have spent several years lulled into the x86/GCC monoculture. It’s time for me to transcend that narrow-minded view.

So I’ve set out to obtain every C compiler and lint tool I can possibly find, hoping that each one can show me even the smallest spot of imperfection that I can proceed to purge from my crystalline statements and declarations. I was ashamed to realize that I couldn’t even think of a time I had used a non-GCC/MSVC compiler in almost five years, and I didn’t know which ones were available to me. I am happy to say that I have eradicated some of that ignorance, and already have found some bugs that GCC didn’t show me!

So what C compilers are out there? Here are some of the ones I came across:

  • Sun Studio, Sun’s C and C++ compiler. To my surprise, it is available for free. Includes a lint tool.
  • Intel C++ Compiler. You can obtain a non-commercial version for free if you absolutely, totally, cross-your-heart promise that you are not compensated in any way for your work. I have a vision of the Intel guys following around open-source developers at conferences and revoking them their non-commercial licenses at the first sign of any swag.
  • Digital Mars C and C++ compiler, which is freely available for Windows and DOS.
  • Visual C++ Express is a free download from Microsoft. Apparently has very poor C99 support though.
  • The Portland Group has a C and C++ compiler, but there doesn’t seem to be any way to use it without paying several hundred dollars.
  • Comeau has a C and C++ compiler that takes great pride in its standards-compliance. You just have to get over the curiously awful web page, which actually says “Bursting With So Much Language Support It Hurts!” No free use (though at $50, it’s not a big hurdle — I paid more for a parking ticket today. If only I hadn’t had six such parking tickets to pay. For parking on my own street. But that’s a story for another day). There is a web form (!) where you can paste your code and see error/warning messages though, which could be all you need to test standards-compliance.
  • SDCC, the Small Device C Compiler is a GPL’d C compiler that targets tiny embedded processors. I was looking quite forward to this one as a target that is wildly different from other C compilers. Unfortunately, though it has partial C99 support, it does not support 64-bit integer types, and I’m not quite willing tilt at that particular windmill of making pbstream portable to compilers without 64-bit integer support.
  • TCC, the Tiny C Compiler. Fabrice Bellard is an absolute bad-ass, but unfortunately TCC doesn’t seem to support much of C99 (it choked on a for loop that defined a variable).
  • Clang, the LLVM compiler. My version of Ubuntu doesn’t seem to have this available, and I always find LLVM a bit of a pain to compile and install from source, but I would like to try this when I get the chance.

So this should be enough variety that I can feel confident making smug assertions about pbstream’s portability. Now the only question is whether I’ll shell out $50 for the Comeau compiler just for that added potential for portability smugness.

Categories: upb Tags:

The art of hashing

March 1st, 2009 Josh 9 comments

It might sound strange for me to call hashing an art. After all, it’s about the most fundamental data structure we’ve got, after arrays, linked lists, and binary trees. It’s been studied extensively for fifty years. What amount of “art” could possibly be left in such a topic?

I’ve learned to anticipate that there are dark and unstudied corners in even the most fundamental computer science topics. And hashing appears to be no exception. I’m investigating hashing at the moment because I need hash tables in a few places for my rapidly-developing Protocol Buffers implementation pbstream. So I wanted to take a quick survey of the state-of-the-art in hash tables, and implement a minimal hash table that suits my needs.

What are my needs? If someone defines a protobuf like so:

message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
}

…I need to build two hash tables for looking up fields: one keyed by number (1, 2, 3) and one keyed by name (name, id, email).

You might say “why not just use an array for looking up fields by number?” Indeed that’s what I’ll do in most cases, but field numbers can be as large as 2**27 and needn’t be allocated densely. So gracefully degrading to a hash table is important (one practical reason why the field numbers might not be dense is if the client is using extensions). To get the best of both worlds I’ll follow Lua’s lead and have the field number table be a hybrid array/hashtable, where the size of the array part is such that it’s at least half full.

My usage pattern for these hashtables is that I’ll build them once (when I parse a .proto file), then do lookups in the critical path of parsing. So insert time is practically irrelevant, but lookup time is extremely important, because it’s in my critical path. I’ll gladly trade some memory for fast lookups.

So I figured I’d take a few hours to discover the state-of-the-art in hashtables, implement that, and be on my merry way. Unfortunately there doesn’t seem to be much agreement about what the state of the art even is!

Let’s start with even the simple and most basic question: how big should a hashtable be? The literature differs here: some of it claiming that hashtables should have sizes that are prime numbers and some saying they should be powers of two.

But once you’ve decided that, you’ve only just entered the jungle of collision-resolution strategies. There are a lot of them, and it’s not at all clear to me that the trade-offs are well understood. A list of ones I’ve come across, and my current understanding of what they mean:

  • linear probing: if your hash function is h(k), then a collision in T[h(k)] means you should also try T[h(k)+i] for i=1 to some limit.
  • quadratic probing: like linear probing, but also throw in a T[h(k) + i + j^2], to spread it out a little.
  • double hashing: like linear probing, but you scale the jumps by another hash function: T[h(k) + g(k)*i].
  • chaining: all entries that collide are in a linked list, so for lookups you search this list. chaining can be either internal (table entries have both values and links, links point to other locations in the table) or external (table is just pointers to the head of linked lists, inserts always allocate a new node).
  • cuckoo hashing: Wikipedia makes this sound like hashing’s best-kept-secret. You use two hash functions: if you do an insertion and the spot you want is taken, you kick whatever was already there out and put it in its alternate spot.
  • two-way chaining: use two hash functions (like cuckoo and double hashing), but you use the two hash values to find the two possible chains the value could be in.

Few papers seem to offer a comprehensive account of the trade-offs between these. Some authors claim that linear probing is best on modern CPUs because of excellent locality (cache-friendly), but this paper [PDF] takes exception to that claim, saying that the better overall performance of double-hashing negates the cache effects. And indeed, none of the real-world implementations I found use linear probing.

A quick survey of mainstream hashing implementations shows that they can’t agree on what’s best either.

  • Lua uses internal chaining with powers-of-two table sizes and “Brent’s variation” (so-named because of a 1973 paper called Reducing the retrieval time of scatter storage techniques — so much for caching-conscious schemes).
  • Ruby uses internal chaining with prime table sizes.
  • Python uses a custom probing scheme. You have to read the comments in the source file itself — they do more to make hashing sound like black magic than I possible could.
  • Perl appears to use external chaining, but its code is a bit hard to follow.

Even if you decide on a collision strategy, you still have to decide on a hash function, and there are surprises lurking for you there as well. Conventional wisdom about hashing is that you want a hash function that randomly distributes the inputs to the outputs. But Python and Lua both hash integers to themselves (so h(5) = 5)!

There are just a dizzying number of variations on hashing. Given my relatively simple use case (one-time construction, lookup speed trumps all) you’d think there would be a clear and obvious answer. But I don’t see one.

Given my deep respect for Lua’s implementation, I’m tempted to just start with that and worry about trying to optimize it later. I’m slightly disconcerted with its focus on getting good performance even when the table is quite full, because as I mentioned for my use case I’ll gladly trade modest amounts of RAM for faster lookup. At least it’s someplace to start.

But seriously, who knew that hash tables had so many variations, with no clear winner?

Categories: upb Tags: