<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: &#8220;Unicode is not an encoding&#8221;</title>
	<atom:link href="http://blog.reverberate.org/2009/01/31/unicode-not-an-encoding/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.reverberate.org/2009/01/31/unicode-not-an-encoding/</link>
	<description>parsing, performance, minimalism with C99</description>
	<lastBuildDate>Mon, 06 Feb 2012 23:44:51 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: Nathan Youngman</title>
		<link>http://blog.reverberate.org/2009/01/31/unicode-not-an-encoding/comment-page-1/#comment-1301</link>
		<dc:creator>Nathan Youngman</dc:creator>
		<pubDate>Mon, 13 Apr 2009 09:21:10 +0000</pubDate>
		<guid isPermaLink="false">http://blog.reverberate.org/?p=111#comment-1301</guid>
		<description>I took some notes from a talk by Brett Cannon (Python core developer) at a meetup, which I hope I faithfully reproduced onto my blog:

&quot;For performance reasons Unicode strings are stored in UTF-16 internally, but you will likely stick with UTF-8 and let Python take care of the translation.&quot;

http://nathany.com/developer/vanpyz-from-future/

I&#039;m no guru on codepoints and encodings and all, but that sounds like pretty standard fair... doesn&#039;t Windows use UTF-16 for the same reason?</description>
		<content:encoded><![CDATA[<p>I took some notes from a talk by Brett Cannon (Python core developer) at a meetup, which I hope I faithfully reproduced onto my blog:</p>
<p>&#8220;For performance reasons Unicode strings are stored in UTF-16 internally, but you will likely stick with UTF-8 and let Python take care of the translation.&#8221;</p>
<p><a href="http://nathany.com/developer/vanpyz-from-future/" rel="nofollow">http://nathany.com/developer/vanpyz-from-future/</a></p>
<p>I&#8217;m no guru on codepoints and encodings and all, but that sounds like pretty standard fair&#8230; doesn&#8217;t Windows use UTF-16 for the same reason?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Matt Brubeck</title>
		<link>http://blog.reverberate.org/2009/01/31/unicode-not-an-encoding/comment-page-1/#comment-1219</link>
		<dc:creator>Matt Brubeck</dc:creator>
		<pubDate>Mon, 02 Feb 2009 14:17:37 +0000</pubDate>
		<guid isPermaLink="false">http://blog.reverberate.org/?p=111#comment-1219</guid>
		<description>There&#039;s a bit of equivocation here, unintentional I&#039;m sure.

As other commenters have more or less pointed out, there are two encoding steps in modern character representation systems:  the mapping from linguistic symbols to integers (Unicode &quot;code-points&quot;), and the mapping from integers to bitstrings.  Let&#039;s call these &quot;logical encoding&quot; and &quot;physical encoding&quot; respectively.

In our increasingly Unicode-standardized world, logical encoding is rarely a question in most new software projects, while physical encoding frequently is.  So I don&#039;t blame your commenters for being confused, especially when your two examples included one related only to logical encoding (Han unification) and one related only to physical encoding (space efficiency of Asian scripts).

The difference between Ruby and Python seems to be that, while both support various physical encodings for I/O, Python 3&#039;s string objects are always represented in memory as sequences of Unicode code-points, while Ruby 1.9 can use any supported physical encoding as the in-memory representation.  Theoretically this would allow Ruby to support encoding schemes that are not Unicode-compatible, as suggested above.  I&#039;m not sure any have been implemented in practice, though.  In my Ruby 1.9 installation, all 81 installed encodings are compatible (using Encoding.compatible?).

Regardless of the file encoding, &quot;\uXXXX&quot; in a Ruby 1.9 string literal refers to a Unicode code-point.  So the Unicode logical encoding scheme does still have a special place in Ruby&#039;s heart.  :)</description>
		<content:encoded><![CDATA[<p>There&#8217;s a bit of equivocation here, unintentional I&#8217;m sure.</p>
<p>As other commenters have more or less pointed out, there are two encoding steps in modern character representation systems:  the mapping from linguistic symbols to integers (Unicode &#8220;code-points&#8221;), and the mapping from integers to bitstrings.  Let&#8217;s call these &#8220;logical encoding&#8221; and &#8220;physical encoding&#8221; respectively.</p>
<p>In our increasingly Unicode-standardized world, logical encoding is rarely a question in most new software projects, while physical encoding frequently is.  So I don&#8217;t blame your commenters for being confused, especially when your two examples included one related only to logical encoding (Han unification) and one related only to physical encoding (space efficiency of Asian scripts).</p>
<p>The difference between Ruby and Python seems to be that, while both support various physical encodings for I/O, Python 3&#8242;s string objects are always represented in memory as sequences of Unicode code-points, while Ruby 1.9 can use any supported physical encoding as the in-memory representation.  Theoretically this would allow Ruby to support encoding schemes that are not Unicode-compatible, as suggested above.  I&#8217;m not sure any have been implemented in practice, though.  In my Ruby 1.9 installation, all 81 installed encodings are compatible (using Encoding.compatible?).</p>
<p>Regardless of the file encoding, &#8220;\uXXXX&#8221; in a Ruby 1.9 string literal refers to a Unicode code-point.  So the Unicode logical encoding scheme does still have a special place in Ruby&#8217;s heart.  <img src='http://blog.reverberate.org/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jeffld</title>
		<link>http://blog.reverberate.org/2009/01/31/unicode-not-an-encoding/comment-page-1/#comment-1215</link>
		<dc:creator>jeffld</dc:creator>
		<pubDate>Mon, 02 Feb 2009 01:57:08 +0000</pubDate>
		<guid isPermaLink="false">http://blog.reverberate.org/?p=111#comment-1215</guid>
		<description>UTF-8 is the 8-bit Unicode Transformation Format which is a variable length character encoding. 

My favorite character is a áº½. That is a small e with a tilde over it.  Notice that it takes 3 characters in a row to render it.  


0xE1 0xBA 0xBD


I really don&#039;t care much for Unicode, but it isn&#039;t going away so I guess I will just have to embrace it.</description>
		<content:encoded><![CDATA[<p>UTF-8 is the 8-bit Unicode Transformation Format which is a variable length character encoding. </p>
<p>My favorite character is a áº½. That is a small e with a tilde over it.  Notice that it takes 3 characters in a row to render it.  </p>
<p>0xE1 0xBA 0xBD</p>
<p>I really don&#8217;t care much for Unicode, but it isn&#8217;t going away so I guess I will just have to embrace it.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Paul Prescod</title>
		<link>http://blog.reverberate.org/2009/01/31/unicode-not-an-encoding/comment-page-1/#comment-1213</link>
		<dc:creator>Paul Prescod</dc:creator>
		<pubDate>Sun, 01 Feb 2009 21:04:40 +0000</pubDate>
		<guid isPermaLink="false">http://blog.reverberate.org/?p=111#comment-1213</guid>
		<description>It seems that the Unicode standard uses the word &quot;encoding&quot; to mean something different than Ruby and Python (and XML and Java and ...) do:

Unicode: &quot;Encoded Character. An association (or mapping) between an abstract character and a code point.&quot;

XML: &quot;The mechanism for encoding character code points into bit patterns may vary from entity to entity. All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode&quot;

I&#039;m not personally aware of how the language drifted apart between the Unicode standard and the others. Seems odd.

To determine whether Ruby _really_ supports non-Unicode &quot;encodings&quot; (according to the Unicode definition) better than Python does, I&#039;d need to know what Ruby&#039;s definition of &quot;character&quot; is. i.e. how are characters named, how is equivalence defined, and what properties are available on character objects?

In Python, Java, Javascript, C#, etc. two character objects are equivalent if they have the same code point. Given the code point, you can infer the character&#039;s name, classification and other properties based on the Unicode database 

http://unicode.org/ucd/

Can somebody please explain Ruby&#039;s definition of &quot;character&quot;?</description>
		<content:encoded><![CDATA[<p>It seems that the Unicode standard uses the word &#8220;encoding&#8221; to mean something different than Ruby and Python (and XML and Java and &#8230;) do:</p>
<p>Unicode: &#8220;Encoded Character. An association (or mapping) between an abstract character and a code point.&#8221;</p>
<p>XML: &#8220;The mechanism for encoding character code points into bit patterns may vary from entity to entity. All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode&#8221;</p>
<p>I&#8217;m not personally aware of how the language drifted apart between the Unicode standard and the others. Seems odd.</p>
<p>To determine whether Ruby _really_ supports non-Unicode &#8220;encodings&#8221; (according to the Unicode definition) better than Python does, I&#8217;d need to know what Ruby&#8217;s definition of &#8220;character&#8221; is. i.e. how are characters named, how is equivalence defined, and what properties are available on character objects?</p>
<p>In Python, Java, Javascript, C#, etc. two character objects are equivalent if they have the same code point. Given the code point, you can infer the character&#8217;s name, classification and other properties based on the Unicode database </p>
<p><a href="http://unicode.org/ucd/" rel="nofollow">http://unicode.org/ucd/</a></p>
<p>Can somebody please explain Ruby&#8217;s definition of &#8220;character&#8221;?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Chris</title>
		<link>http://blog.reverberate.org/2009/01/31/unicode-not-an-encoding/comment-page-1/#comment-1206</link>
		<dc:creator>Chris</dc:creator>
		<pubDate>Sun, 01 Feb 2009 12:52:10 +0000</pubDate>
		<guid isPermaLink="false">http://blog.reverberate.org/?p=111#comment-1206</guid>
		<description>Very fascinating Carl, this corroborates with my real world experience and explains a lot of things for me. Also thanks Josh, I didn&#039;t realize that Unicode includes both charset and encodings. It was a bit fuzzy in my mind.</description>
		<content:encoded><![CDATA[<p>Very fascinating Carl, this corroborates with my real world experience and explains a lot of things for me. Also thanks Josh, I didn&#8217;t realize that Unicode includes both charset and encodings. It was a bit fuzzy in my mind.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Carl</title>
		<link>http://blog.reverberate.org/2009/01/31/unicode-not-an-encoding/comment-page-1/#comment-1200</link>
		<dc:creator>Carl</dc:creator>
		<pubDate>Sun, 01 Feb 2009 06:29:05 +0000</pubDate>
		<guid isPermaLink="false">http://blog.reverberate.org/?p=111#comment-1200</guid>
		<description>The phrase &#8220;The Unicode character encoding&#8221; does not mean an encoding in the sense of &#8220;something that maps into bytes.&#8221; It means &#8220;something that maps into numbers.&#8221; They&#8217;re not agreeing with you about the meaning of Unicode, they&#8217;re disagree with us about the meaning of encoding. Or they would be disagreeing with us if we were talking about the same thing. But we choose to call a &#8220;something that maps into numbers&#8221; a character set for the simple reason that to do otherwise is confusing. It doesn&#8217;t change the core assertion that &#8220;Unicode&#8221; itself cannot be written as byte string. Only UTF-8/16/32/etc. can. 

&#8220;Python is only capable of representing Unicode characters internally, which means that round-trips from encodings that are not injective onto Unicode will be lossy.&#8221;

A) Name one such encoding.

Hell, I&#8217;ll spot you one: Emoji. 

Ah, but &lt;a href=&quot;http://google-opensource.blogspot.com/2008/11/emoji-for-unicode-open-source-data-for.html&quot; rel=&quot;nofollow&quot;&gt;what&#8217;s this&lt;/a&gt;? 

Never mind then. For all practical purposes, every major coding in the world today is either an official part of Unicode or using a section of the private use characters in Unicode by convention with proposals being bandied about for eventual inclusion in Unicode. The only exceptions are things like Mojikyo which have no real support anywhere anyway. (So far as I can tell, Ruby doesn&#8217;t support it any meaningful sense either.) And if you really must work with Mojikyo, well &lt;a href=&quot;http://mail.python.org/pipermail/python-3000/2006-September/003405.html&quot; rel=&quot;nofollow&quot;&gt;there&#8217;s always bytestrings&lt;/a&gt;.

B) The problem for the Japanese is not roundtripping (&lt;a href=&quot;http://web.archive.org/web/20050312062050/http://www.cs.mcgill.ca/~aelias4/encodings.html&quot; rel=&quot;nofollow&quot;&gt;&#8220;The list of Japanese characters in Unicode was ripped straight from JIS, so JIS can be converted into Unicode without many problems.&#8221;&lt;/a&gt;). The problem is that they didn&#8217;t want Chinese characters to map onto Japanese characters. So, there&#8217;s no problem with round-tripping. The problem is going from Shift&lt;em&gt;JIS to Unicode to some Chinese encoding. Or going from Shift&lt;/em&gt;JIS to some Chinese encoding for that matter. They didn&#8217;t want it to be possible, but it is. So, we live with it.</description>
		<content:encoded><![CDATA[<p>The phrase &#8220;The Unicode character encoding&#8221; does not mean an encoding in the sense of &#8220;something that maps into bytes.&#8221; It means &#8220;something that maps into numbers.&#8221; They&#8217;re not agreeing with you about the meaning of Unicode, they&#8217;re disagree with us about the meaning of encoding. Or they would be disagreeing with us if we were talking about the same thing. But we choose to call a &#8220;something that maps into numbers&#8221; a character set for the simple reason that to do otherwise is confusing. It doesn&#8217;t change the core assertion that &#8220;Unicode&#8221; itself cannot be written as byte string. Only UTF-8/16/32/etc. can. </p>
<p>&#8220;Python is only capable of representing Unicode characters internally, which means that round-trips from encodings that are not injective onto Unicode will be lossy.&#8221;</p>
<p>A) Name one such encoding.</p>
<p>Hell, I&#8217;ll spot you one: Emoji. </p>
<p>Ah, but <a href="http://google-opensource.blogspot.com/2008/11/emoji-for-unicode-open-source-data-for.html" rel="nofollow">what&#8217;s this</a>? </p>
<p>Never mind then. For all practical purposes, every major coding in the world today is either an official part of Unicode or using a section of the private use characters in Unicode by convention with proposals being bandied about for eventual inclusion in Unicode. The only exceptions are things like Mojikyo which have no real support anywhere anyway. (So far as I can tell, Ruby doesn&#8217;t support it any meaningful sense either.) And if you really must work with Mojikyo, well <a href="http://mail.python.org/pipermail/python-3000/2006-September/003405.html" rel="nofollow">there&#8217;s always bytestrings</a>.</p>
<p>B) The problem for the Japanese is not roundtripping (<a href="http://web.archive.org/web/20050312062050/http://www.cs.mcgill.ca/~aelias4/encodings.html" rel="nofollow">&#8220;The list of Japanese characters in Unicode was ripped straight from JIS, so JIS can be converted into Unicode without many problems.&#8221;</a>). The problem is that they didn&#8217;t want Chinese characters to map onto Japanese characters. So, there&#8217;s no problem with round-tripping. The problem is going from Shift<em>JIS to Unicode to some Chinese encoding. Or going from Shift</em>JIS to some Chinese encoding for that matter. They didn&#8217;t want it to be possible, but it is. So, we live with it.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: josh</title>
		<link>http://blog.reverberate.org/2009/01/31/unicode-not-an-encoding/comment-page-1/#comment-1196</link>
		<dc:creator>josh</dc:creator>
		<pubDate>Sun, 01 Feb 2009 03:39:30 +0000</pubDate>
		<guid isPermaLink="false">http://blog.reverberate.org/?p=111#comment-1196</guid>
		<description>@Carl: Your noun vs. adjective stuff is making a distinction that the Unicode standard itself does not make.  The Unicode standard speaks of &quot;The Unicode character encoding,&quot; speaking in reference to the character set, not the byte-encoding:

&quot;The Unicode character encoding treats alphabetic characters, ideographic characters, and symbols equivalently, which means they can be used in any mixture and with equal facility.&quot;

also:

&quot;The Unicode Standard specifies a numeric value (code point) and a name for each of its characters.  In this respect, it is similar to other character encoding standards from ASCII onward.&quot;

According to you, the above statements are nonsense, because the they concern the character set, not the byte-encoding.  And yet this is &lt;i&gt;straight out of page 1 of the standard&lt;/i&gt;.  You&#039;re going to have to admit that there&#039;s some latitude in terminology here, unless this list of people who &quot;don&#039;t know the difference between an encoding and a character set&quot; includes the Unicode Standard 5.0, page 1.

I&#039;m quite aware of the difference between a character set and a character encoding, when this specific distinction is drawn.  Python is only capable of representing Unicode characters internally, which means that round-trips from encodings that are not &lt;a href=&quot;http://en.wikipedia.org/wiki/Injective_function&quot; rel=&quot;nofollow&quot;&gt;injective&lt;/a&gt; onto Unicode will be lossy.</description>
		<content:encoded><![CDATA[<p>@Carl: Your noun vs. adjective stuff is making a distinction that the Unicode standard itself does not make.  The Unicode standard speaks of &#8220;The Unicode character encoding,&#8221; speaking in reference to the character set, not the byte-encoding:</p>
<p>&#8220;The Unicode character encoding treats alphabetic characters, ideographic characters, and symbols equivalently, which means they can be used in any mixture and with equal facility.&#8221;</p>
<p>also:</p>
<p>&#8220;The Unicode Standard specifies a numeric value (code point) and a name for each of its characters.  In this respect, it is similar to other character encoding standards from ASCII onward.&#8221;</p>
<p>According to you, the above statements are nonsense, because the they concern the character set, not the byte-encoding.  And yet this is <i>straight out of page 1 of the standard</i>.  You&#8217;re going to have to admit that there&#8217;s some latitude in terminology here, unless this list of people who &#8220;don&#8217;t know the difference between an encoding and a character set&#8221; includes the Unicode Standard 5.0, page 1.</p>
<p>I&#8217;m quite aware of the difference between a character set and a character encoding, when this specific distinction is drawn.  Python is only capable of representing Unicode characters internally, which means that round-trips from encodings that are not <a href="http://en.wikipedia.org/wiki/Injective_function" rel="nofollow">injective</a> onto Unicode will be lossy.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Carl</title>
		<link>http://blog.reverberate.org/2009/01/31/unicode-not-an-encoding/comment-page-1/#comment-1194</link>
		<dc:creator>Carl</dc:creator>
		<pubDate>Sun, 01 Feb 2009 03:21:10 +0000</pubDate>
		<guid isPermaLink="false">http://blog.reverberate.org/?p=111#comment-1194</guid>
		<description>This is completely beside the point. &#8220;Unicode&#8221; is a noun and an adjective. UTF-8/16/32 are &#8220;Unicode encodings,&#8221; yes. But when you use Unicode as a noun, it refers to the mapping of numbers to characters, and not how the mapping is expressed in binary. This is not a pedantic point. It&#8217;s the basis for the entire ability of people to mix different writing systems together in one webpage.

Your last entry showed that you either don&#8217;t know what&#8217;s different about Python 3.0 (which is forgivable for a Ruby person) or that you don&#8217;t know what the difference between an encoding and a character set is (which is unforgivable for anyone doing web programming). Based on this &#8220;defense,&#8221; I&#8217;m leaning toward the former. </description>
		<content:encoded><![CDATA[<p>This is completely beside the point. &#8220;Unicode&#8221; is a noun and an adjective. UTF-8/16/32 are &#8220;Unicode encodings,&#8221; yes. But when you use Unicode as a noun, it refers to the mapping of numbers to characters, and not how the mapping is expressed in binary. This is not a pedantic point. It&#8217;s the basis for the entire ability of people to mix different writing systems together in one webpage.</p>
<p>Your last entry showed that you either don&#8217;t know what&#8217;s different about Python 3.0 (which is forgivable for a Ruby person) or that you don&#8217;t know what the difference between an encoding and a character set is (which is unforgivable for anyone doing web programming). Based on this &#8220;defense,&#8221; I&#8217;m leaning toward the former.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

