Today I spent a lot of time figuring out the state of the art for character set support in C. I wanted to summarize my findings: hopefully distilling what I’ve learned will be useful to someone, as well as giving me the chance to run this by people who are really good at this stuff (I’m looking at you, Matt Brubeck :).
A little background: I am currently working on a parsing framework – something designed to be an improvement on the current generation of parser/lexer generation tools like flex, bison, and ANTLR. Naturally, such a thing designed in 2007 will have to be encoding aware to deserve even a cursory glance from today’s sophisticated and culturally-sensitive programmers, so I spent today figuring out how to do that using my chosen tool: C.
After a lot of investigating, I arrived at this mental model:
“Encoded text” would be bytes of data in UTF-8, ASCII, Shift JIS, or any other character encoding. It’s data you would send across the wire as part of an HTTP request or response (provided you tell your peer what encoding you’re using), store in a file, or process as a character string in a language like C.
C programmers can transcode encoded text between two encodings using the iconv library. Iconv is a transcoding API that first appeared in HP-UX, but which was subsequently standardized as part of the Single Unix Specification. There is a GNU implementation that is used in Linux and probably other free UNIXes also.
The interface to iconv is extremely simple: you give it encoded data in one encoding, and it will transcode that data to another encoding. Here is some sample code:
There are a few small complications, but that’s basically it. One thing to note is that in many encodings, it is possible to construct invalid byte sequences – byte sequences that don’t follow the rules of the encoding – and so Iconv has a way of reporting errors of this sort to the user. Since this indicates invalid or corrupt data, your options are to discard the corrupt data and press on, or halt and compain.
One question is how you specify the source and destination encodings to Iconv. Iconv identifies encodings by text strings like “UTF-8,” and you can dump a list of all the encodings supported by you local Iconv installation by typing “iconv -l” (at least for GNU libiconv). But this doesn’t answer the question of how the names for these encodings are standardized – will different Iconv implementations refer to these encodings by the same names? The SUSv2 documentation for Iconv says that these encoding names are implementation dependent. But that is far from ideal – we want to write programs that will work the same across Iconv implementations.
There is one standardized list of character sets and names for them, which is published and maintained by IANA: IANA Character Sets. It’s not written anywhere, but I get the impression that GNU libiconv at least supports all the IANA-standard charset names. I have no idea what other Iconv implementations do, but I bet it’s inconsistent.
As far as I can see, the only option for C programs that want to portably provide transcoding between arbitrary character sets is to tell their users to use IANA-standard charset names, and consider any Iconv implementation that doesn’t support them broken.
Decoding bytes into characters
So far I’ve only addressed the problem of how to convert byte-oriented text between encodings. With this capability alone, we don’t have the tools necessary to:
- count the number of characters in a string
- iterate over the characters
- get integer values for the character, so we can compare the character to other characters or ask questions like isupper()
Basically we’re missing the ability to break the encoded stream into characters. Where do we turn for this?
It turns out (and this surprised me a lot) a bunch of wide character support was added to C99 (a PDF of the spec is freely available online). This includes functions for decoding bytes into characters. Unfortunately it’s not as useful as it could be – I’ll get into this in a moment. With C99 wide character support, the world looks like this.
- C99 defines the “basic character set” in which everything fits into a single byte (think ASCII, because that’s what it will be in most cases), and the “extended character set”, a superset of the basic character set that adds locale-specific members.
- There are two new types.
wchar_tmust be large enough to hold members of the extended character set for any supported locale.
wint_tis the same but also must be able to hold a special value WEOF that is distinct from any member of the extended character set.
- There are
wint_tspecific versions of existing C library functions that operate on characters like
C99 also includes the jackpot in this case: the functions
wcrtomb() functions convert back and
forth between a
char* buffer and a wchar_t. Using these
functions, we can break an encoded byte stream into integer character
values and vice-versa. We can pass the character values to functions
Unfortunately there is a downside. All these C99 functions are based
on the idea of your current locale. This is a process-wide global
(seriously! gross, I know) that is set by the
call. So if you want to decode characters from more than one
encoding, you have to do a setlocale() each time, which affects all
locale-specific functions in all threads of your program. Thankfully,
glibc at least has separate versions of all these functions that take
locale specifically as a parameter. For example, mbrtowc_l(). But
these are not standard, and I have no idea how widely supported they
Another problem is that locale names are not the same as charset
names. Locale names look something like
en_US.UTF-8. As you can see, they are a combination of
country, language, and charset, and the charset name may or may not be
part of the locale name. There is a SUSv2 function
nl_langinfo() that will return a text description of a
locale’s charset, but the names it returns are not standardized, which
led a guy named Bruno Haible to create libcharset,
whose only purpose is to call nl_langinfo() and translate its weird,
non-standard names into “canonical” names (by which I can only assume
he means IANA-standard). Furthermore, most systems don’t have that
many locales installed (you can check your system with
-a) – your locales probably don’t support as many charsets as
your iconv installation does.
As one final twist of the knife, GCC’s C99 suport matrix says that wchar support is “missing” and also is a “library issue.” I don’t know exactly what this means, and I hope to get an answer to this question soon.
An iconv() trick
Because of all the problems I just mentioned with C99’s
wcrtomb functions, the glibc manual recommends
not using them. Its alternative? Use Iconv with a special
“WCHAR_T” charset specifier. I couldn’t find any official
documentation about what “WCHAR_T” means to iconv, but this
mailing list post from Ulrich Drepper claims that it means “the
system dependent and locale dependent wide character encoding.” So
you can effectively use Iconv to convert byte-oriented text into
wchar_t-size characters. It’s a sort of roundabout way to
achieve the same ends as the
Using this trick will allow you to break down encoded byte streams
into characters. But it’s not clear to me whether Iconv’s
interpretation of WCHAR_T would depend on your process’s current
locale. That seems to be what Ulrich was saying, but to have iconv()
behave in a locale-aware way seems really contrary to its normal
behavior. Would it use the current locale when you call
iconv(), or both?
A bigger problem is that this “pass
WCHAR_T to iconv”
strategy doesn’t appear to be very portable – my OS X 10.4.9
installation of iconv doesn’t support it. Sounds like it’s back to
the drawing board.