“Unicode is not an encoding”
A lot of people commented on my last entry with the “Unicode is not an encoding” retort. This is apparently a pedantic point that people love to make to prove that they know more about internationalization of text than you do.
The first such comment:
Unicode is not an encoding – It’s a charset. It’s also the only sane way to represent characters in memory.
I’ll see your pedantry and raise you one. Unicode is not only a character set, it’s actually a character set and three associated encodings. I will refer you to the Unicode Standard 5.0, section 2.5:
The Unicode Standard provides three distinct encoding forms for Unicode characters, using 8-bit, 16-bit, and 32-bit units. These are named UTF-8, UTF-16, and UTF-32 respectively.
These three encodings are part of the Unicode standard, and thus are part of Unicode.
I will furthermore refer you to the first sentence of the standard:
The Unicode Standard is the universal character encoding standard for written characters and text.
I will furthermore refer you to two paragraphs later, which speaks of “The Unicode character encoding.”
I will furthermore point out that the word “encoding” or “encode” appears 10 times in the opening section of the standard, while the word “set” (or “character set”) appears exactly once, and is in fact referring to ASCII (”the ASCII charcter set”), not Unicode.
If the Unicode Standard itself is allowed to refer to Unicode inclusive as an “encoding,” then I am too. Though I hate to deprive the pedants of a point that they love correcting other people about.
This is completely beside the point. “Unicode” is a noun and an adjective. UTF-8/16/32 are “Unicode encodings,” yes. But when you use Unicode as a noun, it refers to the mapping of numbers to characters, and not how the mapping is expressed in binary. This is not a pedantic point. It’s the basis for the entire ability of people to mix different writing systems together in one webpage.
Your last entry showed that you either don’t know what’s different about Python 3.0 (which is forgivable for a Ruby person) or that you don’t know what the difference between an encoding and a character set is (which is unforgivable for anyone doing web programming). Based on this “defense,” I’m leaning toward the former.
@Carl: Your noun vs. adjective stuff is making a distinction that the Unicode standard itself does not make. The Unicode standard speaks of “The Unicode character encoding,” speaking in reference to the character set, not the byte-encoding:
“The Unicode character encoding treats alphabetic characters, ideographic characters, and symbols equivalently, which means they can be used in any mixture and with equal facility.”
also:
“The Unicode Standard specifies a numeric value (code point) and a name for each of its characters. In this respect, it is similar to other character encoding standards from ASCII onward.”
According to you, the above statements are nonsense, because the they concern the character set, not the byte-encoding. And yet this is straight out of page 1 of the standard. You’re going to have to admit that there’s some latitude in terminology here, unless this list of people who “don’t know the difference between an encoding and a character set” includes the Unicode Standard 5.0, page 1.
I’m quite aware of the difference between a character set and a character encoding, when this specific distinction is drawn. Python is only capable of representing Unicode characters internally, which means that round-trips from encodings that are not injective onto Unicode will be lossy.
The phrase “The Unicode character encoding” does not mean an encoding in the sense of “something that maps into bytes.” It means “something that maps into numbers.” They’re not agreeing with you about the meaning of Unicode, they’re disagree with us about the meaning of encoding. Or they would be disagreeing with us if we were talking about the same thing. But we choose to call a “something that maps into numbers” a character set for the simple reason that to do otherwise is confusing. It doesn’t change the core assertion that “Unicode” itself cannot be written as byte string. Only UTF-8/16/32/etc. can.
“Python is only capable of representing Unicode characters internally, which means that round-trips from encodings that are not injective onto Unicode will be lossy.”
A) Name one such encoding.
Hell, I’ll spot you one: Emoji.
Ah, but what’s this?
Never mind then. For all practical purposes, every major coding in the world today is either an official part of Unicode or using a section of the private use characters in Unicode by convention with proposals being bandied about for eventual inclusion in Unicode. The only exceptions are things like Mojikyo which have no real support anywhere anyway. (So far as I can tell, Ruby doesn’t support it any meaningful sense either.) And if you really must work with Mojikyo, well there’s always bytestrings.
B) The problem for the Japanese is not roundtripping (“The list of Japanese characters in Unicode was ripped straight from JIS, so JIS can be converted into Unicode without many problems.”). The problem is that they didn’t want Chinese characters to map onto Japanese characters. So, there’s no problem with round-tripping. The problem is going from ShiftJIS to Unicode to some Chinese encoding. Or going from ShiftJIS to some Chinese encoding for that matter. They didn’t want it to be possible, but it is. So, we live with it.
Very fascinating Carl, this corroborates with my real world experience and explains a lot of things for me. Also thanks Josh, I didn’t realize that Unicode includes both charset and encodings. It was a bit fuzzy in my mind.
It seems that the Unicode standard uses the word “encoding” to mean something different than Ruby and Python (and XML and Java and …) do:
Unicode: “Encoded Character. An association (or mapping) between an abstract character and a code point.”
XML: “The mechanism for encoding character code points into bit patterns may vary from entity to entity. All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode”
I’m not personally aware of how the language drifted apart between the Unicode standard and the others. Seems odd.
To determine whether Ruby _really_ supports non-Unicode “encodings” (according to the Unicode definition) better than Python does, I’d need to know what Ruby’s definition of “character” is. i.e. how are characters named, how is equivalence defined, and what properties are available on character objects?
In Python, Java, Javascript, C#, etc. two character objects are equivalent if they have the same code point. Given the code point, you can infer the character’s name, classification and other properties based on the Unicode database
http://unicode.org/ucd/
Can somebody please explain Ruby’s definition of “character”?
UTF-8 is the 8-bit Unicode Transformation Format which is a variable length character encoding.
My favorite character is a ẽ. That is a small e with a tilde over it. Notice that it takes 3 characters in a row to render it.
0xE1 0xBA 0xBD
I really don’t care much for Unicode, but it isn’t going away so I guess I will just have to embrace it.
There’s a bit of equivocation here, unintentional I’m sure.
As other commenters have more or less pointed out, there are two encoding steps in modern character representation systems: the mapping from linguistic symbols to integers (Unicode “code-points”), and the mapping from integers to bitstrings. Let’s call these “logical encoding” and “physical encoding” respectively.
In our increasingly Unicode-standardized world, logical encoding is rarely a question in most new software projects, while physical encoding frequently is. So I don’t blame your commenters for being confused, especially when your two examples included one related only to logical encoding (Han unification) and one related only to physical encoding (space efficiency of Asian scripts).
The difference between Ruby and Python seems to be that, while both support various physical encodings for I/O, Python 3’s string objects are always represented in memory as sequences of Unicode code-points, while Ruby 1.9 can use any supported physical encoding as the in-memory representation. Theoretically this would allow Ruby to support encoding schemes that are not Unicode-compatible, as suggested above. I’m not sure any have been implemented in practice, though. In my Ruby 1.9 installation, all 81 installed encodings are compatible (using Encoding.compatible?).
Regardless of the file encoding, “\uXXXX” in a Ruby 1.9 string literal refers to a Unicode code-point. So the Unicode logical encoding scheme does still have a special place in Ruby’s heart.
I took some notes from a talk by Brett Cannon (Python core developer) at a meetup, which I hope I faithfully reproduced onto my blog:
“For performance reasons Unicode strings are stored in UTF-16 internally, but you will likely stick with UTF-8 and let Python take care of the translation.”
http://nathany.com/developer/vanpyz-from-future/
I’m no guru on codepoints and encodings and all, but that sounds like pretty standard fair… doesn’t Windows use UTF-16 for the same reason?