Transcoding adventures with C

Today I spent a lot of time figuring out the state of the art for character set support in C. I wanted to summarize my findings: hopefully distilling what I’ve learned will be useful to someone, as well as giving me the chance to run this by people who are really good at this stuff (I’m looking at you, Matt Brubeck :).

A little background: I am currently working on a parsing framework – something designed to be an improvement on the current generation of parser/lexer generation tools like flex, bison, and ANTLR. Naturally, such a thing designed in 2007 will have to be encoding aware to deserve even a cursory glance from today’s sophisticated and culturally-sensitive programmers, so I spent today figuring out how to do that using my chosen tool: C.

After a lot of investigating, I arrived at this mental model:

Transcoding

“Encoded text” would be bytes of data in UTF-8, ASCII, Shift JIS, or any other character encoding. It’s data you would send across the wire as part of an HTTP request or response (provided you tell your peer what encoding you’re using), store in a file, or process as a character string in a language like C.

C programmers can transcode encoded text between two encodings using the iconv library. Iconv is a transcoding API that first appeared in HP-UX, but which was subsequently standardized as part of the Single Unix Specification. There is a GNU implementation that is used in Linux and probably other free UNIXes also.

The interface to iconv is extremely simple: you give it encoded data in one encoding, and it will transcode that data to another encoding. Here is some sample code:

#include <iconv.h>
#include <stdio.h>

int main()
{
    char latin1_buf[] = "370";
    char utf8_buf[10];

    // We have to set up a few pointers to this stuff, because iconv()
    // likes to return data by modifying these pointers in place.
    char *latin1_ptr = latin1_buf;
    char *utf8_ptr = utf8_buf;
    size_t inbytesleft = sizeof(latin1_buf) - 1;
    size_t outbytesleft = sizeof(utf8_buf) - 1;

    // Allocate a "conversion descriptor" for converting ISO-8859-1 to UTF-8.
    iconv_t iconv_cd = iconv_open("UTF-8", "ISO-8859-1");
    if(iconv_cd == (iconv_t)-1)
    {
        printf("Unable to create conversion description!n");

    // Perform the actual conversion.
    size_t chars = iconv(iconv_cd, &latin1_ptr, &inbytesleft,
                                   &utf8_ptr, &outbytesleft);

    // Print the results
    if(chars == -1)
        printf("An error occured!n");
    else
    {
        printf("%zd chars convertedn", chars);
        printf("UTF-8 bytes: ");
        for(int i = 0; i < (sizeof(utf8_buf) - outbytesleft - 1); i++)
            printf("%02hhx ", utf8_buf[i]);
        printf("n");
    }
}

There are a few small complications, but that’s basically it. One thing to note is that in many encodings, it is possible to construct invalid byte sequences – byte sequences that don’t follow the rules of the encoding – and so Iconv has a way of reporting errors of this sort to the user. Since this indicates invalid or corrupt data, your options are to discard the corrupt data and press on, or halt and compain.

One question is how you specify the source and destination encodings to Iconv. Iconv identifies encodings by text strings like “UTF-8,” and you can dump a list of all the encodings supported by you local Iconv installation by typing “iconv -l” (at least for GNU libiconv). But this doesn’t answer the question of how the names for these encodings are standardized – will different Iconv implementations refer to these encodings by the same names? The SUSv2 documentation for Iconv says that these encoding names are implementation dependent. But that is far from ideal – we want to write programs that will work the same across Iconv implementations.

There is one standardized list of character sets and names for them, which is published and maintained by IANA: IANA Character Sets. It’s not written anywhere, but I get the impression that GNU libiconv at least supports all the IANA-standard charset names. I have no idea what other Iconv implementations do, but I bet it’s inconsistent.

As far as I can see, the only option for C programs that want to portably provide transcoding between arbitrary character sets is to tell their users to use IANA-standard charset names, and consider any Iconv implementation that doesn’t support them broken.

Decoding bytes into characters

So far I’ve only addressed the problem of how to convert byte-oriented text between encodings. With this capability alone, we don’t have the tools necessary to:

count the number of characters in a string
iterate over the characters
get integer values for the character, so we can compare the character to other characters or ask questions like isupper()

Basically we’re missing the ability to break the encoded stream into characters. Where do we turn for this?

It turns out (and this surprised me a lot) a bunch of wide character support was added to C99 (a PDF of the spec is freely available online). This includes functions for decoding bytes into characters. Unfortunately it’s not as useful as it could be – I’ll get into this in a moment. With C99 wide character support, the world looks like this.

C99 defines the “basic character set” in which everything fits into a single byte (think ASCII, because that’s what it will be in most cases), and the “extended character set”, a superset of the basic character set that adds locale-specific members.
There are two new types.
- wchar_t must be large enough to hold members of the extended character set for any supported locale.
- wint_t is the same but also must be able to hold a special value WEOF that is distinct from any member of the extended character set.
There are wchar_t and wint_t specific versions of existing C library functions that operate on characters like is{alpha,number,digit}() called is**w**{alpha,number,digit}

C99 also includes the jackpot in this case: the functions mbrtowc() and wcrtomb() functions convert back and forth between a char* buffer and a wchar_t. Using these functions, we can break an encoded byte stream into integer character values and vice-versa. We can pass the character values to functions like iswalpha().

Unfortunately there is a downside. All these C99 functions are based on the idea of your current locale. This is a process-wide global (seriously! gross, I know) that is set by the setlocale() call. So if you want to decode characters from more than one encoding, you have to do a setlocale() each time, which affects all locale-specific functions in all threads of your program. Thankfully, glibc at least has separate versions of all these functions that take locale specifically as a parameter. For example, mbrtowc_l(). But these are not standard, and I have no idea how widely supported they are.

Another problem is that locale names are not the same as charset names. Locale names look something like en_US or en_US.UTF-8. As you can see, they are a combination of country, language, and charset, and the charset name may or may not be part of the locale name. There is a SUSv2 function nl_langinfo() that will return a text description of a locale’s charset, but the names it returns are not standardized, which led a guy named Bruno Haible to create libcharset, whose only purpose is to call nl_langinfo() and translate its weird, non-standard names into “canonical” names (by which I can only assume he means IANA-standard). Furthermore, most systems don’t have that many locales installed (you can check your system with locale -a) – your locales probably don’t support as many charsets as your iconv installation does.

As one final twist of the knife, GCC’s C99 suport matrix says that wchar support is “missing” and also is a “library issue.” I don’t know exactly what this means, and I hope to get an answer to this question soon.

An iconv() trick

Because of all the problems I just mentioned with C99’s mbrtowc() and wcrtomb functions, the glibc manual recommends not using them. Its alternative? Use Iconv with a special “WCHAR_T” charset specifier. I couldn’t find any official documentation about what “WCHAR_T” means to iconv, but this mailing list post from Ulrich Drepper claims that it means “the system dependent and locale dependent wide character encoding.” So you can effectively use Iconv to convert byte-oriented text into wchar_t-size characters. It’s a sort of roundabout way to achieve the same ends as the mbrtowc() and wcrtomb() functions.

Using this trick will allow you to break down encoded byte streams into characters. But it’s not clear to me whether Iconv’s interpretation of WCHAR_T would depend on your process’s current locale. That seems to be what Ulrich was saying, but to have iconv() behave in a locale-aware way seems really contrary to its normal behavior. Would it use the current locale when you call iconv_open(), or iconv(), or both?

A bigger problem is that this “pass WCHAR_T to iconv” strategy doesn’t appear to be very portable – my OS X 10.4.9 installation of iconv doesn’t support it. Sounds like it’s back to the drawing board.