Friday, March 1, 2013

C/C++ Gripe #1: integer types

C is a great language. When someone asks "what would you change about C?" it is not easy for me to think of something that I think it just plain got wrong. And while C++ is large and complicated, I generally feel that, for what it is trying to be it is pretty well done too.

I wanted to preface this entry with that, because the word "gripe" could be taken to mean that I am a C or C++ hater. Not the case; they are my favorite languages that I use regularly. But with the benefit of hindsight, I think it's worth mentioning a few of their design choices here and there that make life difficult, where a genuinely better alternative exists. Which brings me to this entry.

C's integer types are well-explained in the Wikipedia entry C data types. The types char, short, int, long, long long and their unsigned equivalents are defined without specifying their exact size, but instead according to a loose set of rules which, in my experience, are rarely useful.

Until C99 there was no standard way of declaring integers of a specified size (ie. a 16-bit signed integer). You could declare a signed short; this is guaranteed to be at least 16 bits, but it could be larger. The lack of fixed-width types led to every project reinventing the same typedefs for these, over and over. You can have wxInt32 from wxWidgets, or qint32 from Qt or gint32 from glib. Almost every C or C++ library would eventually find itself defining these same typedefs. But thankfully in C99 we got stdint.h which gives us fixed width types like int32_t in the standard library! Problem solved, no?

Well, not quite unfortunately. Since int32_t is just a typedef, there can be multiple primitive types that are 32 bits wide. For example, in the ILP32 programming model, both int and long are 32 bits. So it's totally arbitrary which of these typedefs is in stdint.h:
// Both of these are equally valid to have in stdint.h:
typedef int int32_t;
typedef long int32_t;
And the really unfortunate part is that int and long are still distinct, incompatible types, which means that this program won't compile:
// From library A:
typedef int a_int32_t;

// Passes a pointer to a function that takes a 32-bit integer.
void regfunc(void (*f)(a_int32_t));

// From library B:
typedef long b_int32_t;

void my_callback(b_int32_t x) { /* ... */ }

int main() {
  regfunc(&my_callback);
}
This code will fail to compile because void mycallback(long) is not compatible with void f(int) even though both long and int are 32 bits! And in this case both were used through fixed-width typedefs, so the code looks like it should work.

What would have been better is if the primitive types had been defined in terms of the fixed-width types (int32_t, uint32_t, etc). Then, if desired, the more loosely-defined types like int, long, etc. could be the typedefs. If things were defined this way, then you would never run into this problem where two integer types are the same size, yet are incompatible.

A possible idea for improving the status quo would be to make primitive types compatible if they are the same size. That would make it legal to convert between the two function pointer types, and would probably require basically no work in real-world compilers to enable.

However that doesn't solve a similar problem in C++, which happens when you partially specialize on a_int32_t (for example) only to find that your partial specialization doesn't apply to b_int32_t. Fixing this is not as easy, because some users could have code that depends on these two types having different specializations, even though they are the same size.

In closing: if you invent a new language, please make the primitive integer types fixed-width. Your users will thank you.

7 comments:

  1. C and C++ can't do this because they have to work on platforms with chars that aren't 8 bits. If you read the C and C++ standards, you'll notice that types like int8_t are completely optional. I've worked on systems with 24-bit bytes: char, short, and int were all 24 bits, and longs were 48 bits. It would break programs if on one system int and short aliased but long didn't, and on another system int and long aliased but short didn't. The best thing you can do is just stick to int, size_t, and int_least#_t types, which are guaranteed to exist.

    ReplyDelete
    Replies
    1. I would almost call that gripe #0, that C and C++ go to great lengths to support unusual architectures that are either historic (e. g. PDP-11) or are only used for deeply embedded systems. (I use the qualifier "deeply" since increasingly these days the notion of an "embedded system" is something along the lines of a PowerPC, ARM, or MIPS running Linux with 512 megs of RAM.) Such flexibility should be split off into a separate language (such as "C for microcontrollers") without complicating software development for modern, general-purpose computers (which can even be said to include smartphones, tablets, routers, televisions, and many other allegedly "embedded" systems) which essentially only come in two flavors: ILP32 or LP64.

      Java (for example) takes an approach that greatly simplifies the mental burden on the developer: make assumptions such as 8-bit bytes, IEEE floating point, two's complement arithmetic, etc. which are true on all modern, non-deeply embedded computers. The next version of the C standard should make concessions to the fact that this is the norm, rather than merely one of many equally acceptable possibilities. After all, nearly all computers in the future will be networked, and all Internet standards are based around the notion of an "octet." UTF-18 was an April Fool's joke.

      Delete
    2. My proposal is totally compatible with weird hardware. I am proposing that int8_t would be an optional type, just like it is now, but when it is present it is a built-in type instead of a typedef.

      Just like now, that means that software using int8_t would fail to compile on the weird hardware you mention. But software that depends on 8-bit bytes is unlikely to work on weird hardware anyway, even if it can compile. It's not doing anyone any favors if the software compiles but doesn't work at all. Assumptions that CHAR_BIT == 8 are deeply baked into most software. You don't get portability to weird systems "for free" with C; you have to be vigilant about avoiding these assumptions, and most software isn't vigilant because portability to these system is not a priority.

      Just like now, software that *does* care about portability to hardware like this could stick to int, size_t, and int_leastn_t like you suggest. I think my proposal could actually have worked if C and C++ had been designed this way.

      Delete
  2. I for one don't get the "long" type. It feels archaic and often misunderstood, since it was misunderstood by design.

    Had to make a patch for locally compiled (msvc) library - I think jpeglib where it had "typedef long int32_t" which was conflicting with

    ReplyDelete
    Replies
    1. It looks like perhaps your comment got cut off in the middle?

      Delete
  3. The thing I hate about C/C++ integers is the set of rules for promotions and overflow in arithmetic. Though not impossible to learn and remember, the logic is just a bit too complicated to collapse nicely in all cases. Signed/unsigned types have very different behaviors, and the "integer rank" relative to "int" (and relative to each other in a binary op) makes a huge difference. All of this makes it challenging to write simple and robust overflow checking logic in C, though it's a little bit easier in C++.

    ReplyDelete
    Replies
    1. Totally agree Dave. You might have seen I actually wrote a whole article about overflow checking in C++: http://blog.reverberate.org/2012/12/testing-for-integer-overflow-in-c-and-c.html

      Delete