Posted by josh at June 26th, 2009
Recently I came upon some spirited discussion on reddit concerning a blog post that discussed the use of bit-fields in C. As a quick refresher to anyone unfamiliar or rusty on bit-fields, they are a construct in C that lets you specify how many bits different members of a structure should be allocated:
struct foo {
unsigned int a:1;
unsigned int b:2;
}
Without the bit-field specifiers (the “:1″ and “:2″ above) this structure would be at least four bytes long, and both “a” and “b” would be capable of representing integers from 0 to 65535. With the bit-field specifiers the structure is likely only one byte long, and “a” and “b” can only represent 0-1 and 0-3, respectively. Bit fields are a way to store many distinct values compactly inside a struct, to save memory.
I am using bitfields in upb to store flags for whether each field is set or not. When I generate a structure definition for a specific proto type, code in the header file looks something like this:
struct google_protobuf_DescriptorProto {
union {
uint8_t bytes[1];
struct {
bool name:1; /* = 1, optional. */
bool field:1; /* = 2, repeated. */
bool nested_type:1; /* = 3, repeated. */
bool enum_type:1; /* = 4, repeated. */
bool extension_range:1; /* = 5, repeated. */
bool extension:1; /* = 6, repeated. */
bool options:1; /* = 7, optional. */
} has;
} set_flags;
struct upb_string* name;
UPB_STRUCT_ARRAY(google_protobuf_FieldDescriptorProto)* field;
UPB_STRUCT_ARRAY(google_protobuf_FieldDescriptorProto)* extension;
UPB_STRUCT_ARRAY(google_protobuf_DescriptorProto)* nested_type;
UPB_STRUCT_ARRAY(google_protobuf_EnumDescriptorProto)* enum_type;
UPB_STRUCT_ARRAY(google_protobuf_DescriptorProto_ExtensionRange)* extension_range;
google_protobuf_MessageOptions* options;
};
(that’s from descriptor.h, which is automatically generated from descriptor.proto from the official protobuf implementation).
I “union” the bitfield with bytes so I can set the bytes en-masse more efficiently. This seems like a really nice solution, because then client code that wants to test whether a field is set in a protobuf can write code like:
if(fd->set_flags.has.message_type) {
// ...
}
So unfortunately the commenters on Reddit were pretty down on bit-fields and the aforementioned article. The points they raised filled me with despair that I would have to abandon bit-fields. This is bad because the only real alternative I would have for this kind of code generation is to generate explicit getters and setters for each bit of each structure! The code would then immediately become quite uglified:
if(google_protobuf_DescriptorProto_has_message_type(fd)) {
// ...
}
Besides being longer and forcing you to re-type the whole name of the enclosed type, this approach forces you to generate lots of nearly-identical code and pollute the symbol namespace with tons of these useless functions that add no real value. I was bumming about this.
But as I investigated the supposed problems with bit-fields, I became convinced that the problems were tractable and that I could in good conscience press forward with my plans to use them as intended.
Here Be Dragons
“Almost everything about [bit] fields is implementation-dependent.” –K&R, Section 6.9
From my reading of the C99 standard, here are the guarantees you get (or don’t get) with bit-fields.
1. Bit-fields will be packed as tightly as possible, provided they don’t cross storage unit boundaries: YES. From the standard: (6.7.2.1 #10):
An implementation may allocate any addressable storage unit large enough to hold a bit-field. If enough space remains, a bit-field that immediately follows another bit-field in a structure shall be packed into adjacent bits of the same unit.
The first part just means that the compiler can allocate any number of bytes to a bit-field, as long as it has enough bits. So if you write:
struct foo {
unsigned int a:8;
}
…a will be at least 1 byte long, but could be more. But the second part is where we get the nice guarantee that:
struct foo {
unsigned int a:1;
unsigned int b:2;
}
…will always be packed into a single byte. sizeof(struct foo) could be greater than 1, but you are guaranteed that a and b are in the same byte.
2. Bits are allocated low-to-high (or high-to-low): NO. This is not guaranteed by the spec, so an implementation could choose to allocate bits either starting at the high bit or the low bit of each byte. The following program will test an implementation to see which it is:
#include
int main() {
union {
struct {
unsigned int a:1;
unsigned int b:2;
unsigned int c:3;
} bitfield;
unsigned char ch;
} u = {.ch = 0};
u.bitfield.a = 1;
u.bitfield.c = 7;
switch(u.ch) {
case 0x39: printf("Bits allocated LOW-TO-HIGH.\n"); break;
case 0x9C: printf("Bits allocated HIGH-TO-LOW.\n"); break;
default: printf("Oddball allocation: 0x%02hhx.\n", u.ch); break;
}
return 0;
}
My Goal
What I am trying to do is create a C struct like the above, but also be able to perform data-driven reflection on that C struct based on generic routines like:
INLINE bool upb_msg_is_set(void *s, int index)
{
return ((char*)s)[index / 8] & (1 << (index % 8));
}
This means that my reflection routines like the one above need to be able to mimic the implementation-defined bit ordering as described above. The routine above does not — it assumes that the bits are assigned from low to high. But a few #ifdefs should make the above not too difficult.
Performance
In theory, the ideal compiler should love bit-fields because they give it maximum information with which to optimize loads and stores. I see a lot of people claim that some compilers generate bad code for bit-fields, but that is a possibility with any construct you use. From what I can see GCC generates very good code for bit-fields.
So my conclusion is that the hurdles of bit-fields are tractable, and unless some new information comes my way, I plan to move forward with my plan to use them.