<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: The Ten Commandments of Unicode</title>
	<atom:link href="http://cafe.elharo.com/programming/the-ten-commandments-of-unicode/feed/" rel="self" type="application/rss+xml" />
	<link>http://cafe.elharo.com/programming/the-ten-commandments-of-unicode/</link>
	<description>Longer than a blog; shorter than a book</description>
	<lastBuildDate>Wed, 08 Feb 2012 21:45:25 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
	<item>
		<title>By: Michael Doran</title>
		<link>http://cafe.elharo.com/programming/the-ten-commandments-of-unicode/comment-page-1/#comment-397182</link>
		<dc:creator>Michael Doran</dc:creator>
		<pubDate>Tue, 12 May 2009 20:08:35 +0000</pubDate>
		<guid isPermaLink="false">http://cafe.elharo.com/programming/the-ten-commandments-of-unicode/#comment-397182</guid>
		<description>8. Thou shalt generate all text in Normalization Form C whenever possible.

I&#039;ve tended towards Form D (Canonical Decomposition) as being the more desirable Unicode normalization form.  I am curious as to the rationale for recommending Form C (Canonical Decomposition, followed by Canonical Composition).</description>
		<content:encoded><![CDATA[<p>8. Thou shalt generate all text in Normalization Form C whenever possible.</p>
<p>I&#8217;ve tended towards Form D (Canonical Decomposition) as being the more desirable Unicode normalization form.  I am curious as to the rationale for recommending Form C (Canonical Decomposition, followed by Canonical Composition).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Elliotte Rusty Harold</title>
		<link>http://cafe.elharo.com/programming/the-ten-commandments-of-unicode/comment-page-1/#comment-379375</link>
		<dc:creator>Elliotte Rusty Harold</dc:creator>
		<pubDate>Thu, 02 Apr 2009 12:14:18 +0000</pubDate>
		<guid isPermaLink="false">http://cafe.elharo.com/programming/the-ten-commandments-of-unicode/#comment-379375</guid>
		<description>Not really, I&#039;m afraid. There are a lot of things that could go wrong with that process.</description>
		<content:encoded><![CDATA[<p>Not really, I&#8217;m afraid. There are a lot of things that could go wrong with that process.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bennett</title>
		<link>http://cafe.elharo.com/programming/the-ten-commandments-of-unicode/comment-page-1/#comment-219864</link>
		<dc:creator>Bennett</dc:creator>
		<pubDate>Wed, 23 Apr 2008 22:19:14 +0000</pubDate>
		<guid isPermaLink="false">http://cafe.elharo.com/programming/the-ten-commandments-of-unicode/#comment-219864</guid>
		<description>I have used the private use area in string processing code. I wanted to process strings, but not to touch certain special substrings that may be present in the string. So I replaced the special substrings with private use codes according to a translation table, then processed the string, then translated the private use codes back to the special substrings. I assumed here that the string did not originally contain any private use codes. My use of the private use area was completely transient.Does that seem reasonable?</description>
		<content:encoded><![CDATA[<p>I have used the private use area in string processing code. I wanted to process strings, but not to touch certain special substrings that may be present in the string. So I replaced the special substrings with private use codes according to a translation table, then processed the string, then translated the private use codes back to the special substrings. I assumed here that the string did not originally contain any private use codes. My use of the private use area was completely transient.Does that seem reasonable?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Elliotte Rusty Harold</title>
		<link>http://cafe.elharo.com/programming/the-ten-commandments-of-unicode/comment-page-1/#comment-212626</link>
		<dc:creator>Elliotte Rusty Harold</dc:creator>
		<pubDate>Wed, 02 Apr 2008 14:12:19 +0000</pubDate>
		<guid isPermaLink="false">http://cafe.elharo.com/programming/the-ten-commandments-of-unicode/#comment-212626</guid>
		<description>All I can say is that in nearly all the cases where I&#039;ve seen developers use the private use area, it&#039;s been a mistake, and caused far more pain than it alleviated. Now that almost all characters in day-to-day use have been encoded, creating your own character codes is rarely the right solution to any problem. What the right solution is, I couldn&#039;t tell you without knowing what your problem is.</description>
		<content:encoded><![CDATA[<p>All I can say is that in nearly all the cases where I&#8217;ve seen developers use the private use area, it&#8217;s been a mistake, and caused far more pain than it alleviated. Now that almost all characters in day-to-day use have been encoded, creating your own character codes is rarely the right solution to any problem. What the right solution is, I couldn&#8217;t tell you without knowing what your problem is.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: SusanJ</title>
		<link>http://cafe.elharo.com/programming/the-ten-commandments-of-unicode/comment-page-1/#comment-212621</link>
		<dc:creator>SusanJ</dc:creator>
		<pubDate>Wed, 02 Apr 2008 13:45:48 +0000</pubDate>
		<guid isPermaLink="false">http://cafe.elharo.com/programming/the-ten-commandments-of-unicode/#comment-212621</guid>
		<description>I&#039;m not sure I understand the prohibition on the Private Use Area. I have an application where I need to create my own character codes. What should I do?</description>
		<content:encoded><![CDATA[<p>I&#8217;m not sure I understand the prohibition on the Private Use Area. I have an application where I need to create my own character codes. What should I do?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Elliotte Rusty Harold</title>
		<link>http://cafe.elharo.com/programming/the-ten-commandments-of-unicode/comment-page-1/#comment-210558</link>
		<dc:creator>Elliotte Rusty Harold</dc:creator>
		<pubDate>Thu, 27 Mar 2008 15:34:55 +0000</pubDate>
		<guid isPermaLink="false">http://cafe.elharo.com/programming/the-ten-commandments-of-unicode/#comment-210558</guid>
		<description>And that was relevant in 1987. Today, who really cares? Doubling the size of text (and only text), just doesn&#039;t matter any more. It&#039;s not plain text that&#039;s causing network neutrality disputes and filled hard drives. 

In fact, in many circumstances, including transmission over HTTP, encoding Russian in UTF-8 does not double its size. Even encoding it in UTF-32 wouldn&#039;t double its size. HTTP and modern HTTP servers and clients are a lot smarter than that. Chances are pure Russian text is going out across the network in less than one byte per character no matter which encoding you use.</description>
		<content:encoded><![CDATA[<p>And that was relevant in 1987. Today, who really cares? Doubling the size of text (and only text), just doesn&#8217;t matter any more. It&#8217;s not plain text that&#8217;s causing network neutrality disputes and filled hard drives. </p>
<p>In fact, in many circumstances, including transmission over HTTP, encoding Russian in UTF-8 does not double its size. Even encoding it in UTF-32 wouldn&#8217;t double its size. HTTP and modern HTTP servers and clients are a lot smarter than that. Chances are pure Russian text is going out across the network in less than one byte per character no matter which encoding you use.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mark Thornton</title>
		<link>http://cafe.elharo.com/programming/the-ten-commandments-of-unicode/comment-page-1/#comment-210553</link>
		<dc:creator>Mark Thornton</dc:creator>
		<pubDate>Thu, 27 Mar 2008 15:26:41 +0000</pubDate>
		<guid isPermaLink="false">http://cafe.elharo.com/programming/the-ten-commandments-of-unicode/#comment-210553</guid>
		<description>UTF-8 roughly doubles the size of Russian text and other scripts which usual have a single byte code page but have all common letters with codes &gt; 127. True, the effect on Chinese is less pronounced.</description>
		<content:encoded><![CDATA[<p>UTF-8 roughly doubles the size of Russian text and other scripts which usual have a single byte code page but have all common letters with codes &gt; 127. True, the effect on Chinese is less pronounced.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Elliotte Rusty Harold</title>
		<link>http://cafe.elharo.com/programming/the-ten-commandments-of-unicode/comment-page-1/#comment-207986</link>
		<dc:creator>Elliotte Rusty Harold</dc:creator>
		<pubDate>Thu, 20 Mar 2008 03:55:52 +0000</pubDate>
		<guid isPermaLink="false">http://cafe.elharo.com/programming/the-ten-commandments-of-unicode/#comment-207986</guid>
		<description>That&#039;s a common misconception. UTF-8 is perfectly fine for all languages supported in Unicode. In fact, it has a number of very nice properties that make it superior for all scripts. See &lt;a href=&#039;http://www-128.ibm.com/developerworks/xml/library/x-utf8/&#039; rel=&quot;nofollow&quot;&gt;this article&lt;/a&gt; for more details.</description>
		<content:encoded><![CDATA[<p>That&#8217;s a common misconception. UTF-8 is perfectly fine for all languages supported in Unicode. In fact, it has a number of very nice properties that make it superior for all scripts. See <a href='http://www-128.ibm.com/developerworks/xml/library/x-utf8/' rel="nofollow">this article</a> for more details.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mark Thornton</title>
		<link>http://cafe.elharo.com/programming/the-ten-commandments-of-unicode/comment-page-1/#comment-207752</link>
		<dc:creator>Mark Thornton</dc:creator>
		<pubDate>Wed, 19 Mar 2008 09:34:09 +0000</pubDate>
		<guid isPermaLink="false">http://cafe.elharo.com/programming/the-ten-commandments-of-unicode/#comment-207752</guid>
		<description>Remember that UTF-8 is not a very nice encoding for non western (e.g. Russian, Chinese, etc) languages.</description>
		<content:encoded><![CDATA[<p>Remember that UTF-8 is not a very nice encoding for non western (e.g. Russian, Chinese, etc) languages.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Oren</title>
		<link>http://cafe.elharo.com/programming/the-ten-commandments-of-unicode/comment-page-1/#comment-206946</link>
		<dc:creator>Oren</dc:creator>
		<pubDate>Mon, 17 Mar 2008 21:13:06 +0000</pubDate>
		<guid isPermaLink="false">http://cafe.elharo.com/programming/the-ten-commandments-of-unicode/#comment-206946</guid>
		<description>Tim, I never said non-BMP characters can be ignored. There are lots of things in unicode that look like a single &quot;character&quot; (glyph) but are actually a sequence of a base character followed by combining characters. Even using fully composed form eliminates only some of them. So what difference does it make if the sequence is a surroage pair or a combining sequence? In both cases it&#039;s something your program should not be messing with it if doesn&#039;t understand the nuances. The most it can do safely is to concatenate such strings or maybe split them on well-defined separator characters so you know you will not be splitting a multicharacter sequence in the middle.

But that is all most application software really ever does with strings, anyway. Software that really needs to process them as individual codepoints or glyphs is not written very often. Writing such software requires good understanding of unicode concepts beyond this issue, anyway (e.g. the difference between codepoints and glyphs).</description>
		<content:encoded><![CDATA[<p>Tim, I never said non-BMP characters can be ignored. There are lots of things in unicode that look like a single &#8220;character&#8221; (glyph) but are actually a sequence of a base character followed by combining characters. Even using fully composed form eliminates only some of them. So what difference does it make if the sequence is a surroage pair or a combining sequence? In both cases it&#8217;s something your program should not be messing with it if doesn&#8217;t understand the nuances. The most it can do safely is to concatenate such strings or maybe split them on well-defined separator characters so you know you will not be splitting a multicharacter sequence in the middle.</p>
<p>But that is all most application software really ever does with strings, anyway. Software that really needs to process them as individual codepoints or glyphs is not written very often. Writing such software requires good understanding of unicode concepts beyond this issue, anyway (e.g. the difference between codepoints and glyphs).</p>
]]></content:encoded>
	</item>
</channel>
</rss>

