The Ten Commandments of Unicode

1. I am Unicode, thy character set. Thou shalt have no other character sets before me.

2. Thou shalt carefully specify the character encoding and the character set whenever reading a text file.

3. Thou shalt not refer to any 8-bit character set as “ASCII”.

4. Thou shalt ensure that all string handling functions fully support characters from beyond the Basic Multilingual Plane. Thou shalt not refer to Unicode as a two-byte character set.

5. Thou shalt plan for additions of future characters to Unicode.

6. Thou shalt count and index Unicode characters, not UTF-16 code points.

7. Thou shalt use UTF-8 as the preferred encoding wherever possible.

8. Thou shalt generate all text in Normalization Form C whenever possible.

9. Thou shalt avoid deprecated characters.

10. Thou shalt not enter the private use area.

23 Responses to “The Ten Commandments of Unicode”

  1. John Cowan Says:

    I think #8 should say “the preferred external encoding”. There are a lot of environments in which UTF-16 is overwhelmingly dominant (the Windows API, Java, C#, Javascript) and trying to use UTF-8 internally is just much more work than it’s worth.

  2. Corey Says:

    Related to #2, I wish to god they would deprecate String(byte[]) and String.getBytes(). Not specifying the character set is a bug waiting to happen when dealing with cross-platform applications.

  3. Jerry "these boots" Amayedfore Says:

    I wish “they” (i.e. Corey) would not assume that Java is the only programming language.

  4. Brian Says:

    Nothing about normalization?

    Regarding #2: How can the consumer specify the encoding if the producer didn’t specify it (either externally, or with a BOM)?

  5. John Cowan Says:

    Brian: #8 is about normalization. As for #2, I take it to mean “Don’t assume, when opening an input stream, that the system default is suitable: provide an appropriate encoding yourself.”

  6. Oren Says:

    3.5. Thou shalt not refer to any 8-bit character set as “ANSI”.

  7. Oren Says:

    IMHO, asking everyone to fully support characters beyond the BMP is a bit of an overkill. Unless you are writing software for certain special niche applications you should be fine as long as your code:

    1. Does not choke on surrogate pairs.
    2. Correctly preserves surrogate pairs when saving files.
    3. NEVER emits a lone surrogate character.
  8. Mark Says:

    11. Thou shalt prefer to use ASCII if Unicode can be avoided

  9. tim Says:

    But in most cases, you don’t know what “niche” your application will be used in. Of all the programs I’ve written in my life, across many industries, the only one which could (today) safely ignore non-BMP was aircraft design software, and that’s because they used upper-case ASCII for everything (in the 21st century … yeah, I don’t get it either).

    Supporting all of Unicode isn’t that hard. If your library (or program) doesn’t support it, it’s going to be a royal pain for a lot of us who want to use it.

    In 1990 supporting just ASCII was fine. Then the world grew up, and we got Unicode. You get no pity from me for not supporting all of Unicode in 2008.

  10. Slava Pestov Says:

    It’s funny that Java violates 2, 3, 4, 6, 7, 8 on that list.

  11. Oren Says:

    Tim, I never said non-BMP characters can be ignored. There are lots of things in unicode that look like a single “character” (glyph) but are actually a sequence of a base character followed by combining characters. Even using fully composed form eliminates only some of them. So what difference does it make if the sequence is a surroage pair or a combining sequence? In both cases it’s something your program should not be messing with it if doesn’t understand the nuances. The most it can do safely is to concatenate such strings or maybe split them on well-defined separator characters so you know you will not be splitting a multicharacter sequence in the middle.

    But that is all most application software really ever does with strings, anyway. Software that really needs to process them as individual codepoints or glyphs is not written very often. Writing such software requires good understanding of unicode concepts beyond this issue, anyway (e.g. the difference between codepoints and glyphs).

  12. Mark Thornton Says:

    Remember that UTF-8 is not a very nice encoding for non western (e.g. Russian, Chinese, etc) languages.

  13. Elliotte Rusty Harold Says:

    That’s a common misconception. UTF-8 is perfectly fine for all languages supported in Unicode. In fact, it has a number of very nice properties that make it superior for all scripts. See this article for more details.

  14. Mark Thornton Says:

    UTF-8 roughly doubles the size of Russian text and other scripts which usual have a single byte code page but have all common letters with codes > 127. True, the effect on Chinese is less pronounced.

  15. Elliotte Rusty Harold Says:

    And that was relevant in 1987. Today, who really cares? Doubling the size of text (and only text), just doesn’t matter any more. It’s not plain text that’s causing network neutrality disputes and filled hard drives.

    In fact, in many circumstances, including transmission over HTTP, encoding Russian in UTF-8 does not double its size. Even encoding it in UTF-32 wouldn’t double its size. HTTP and modern HTTP servers and clients are a lot smarter than that. Chances are pure Russian text is going out across the network in less than one byte per character no matter which encoding you use.

  16. SusanJ Says:

    I’m not sure I understand the prohibition on the Private Use Area. I have an application where I need to create my own character codes. What should I do?

  17. Elliotte Rusty Harold Says:

    All I can say is that in nearly all the cases where I’ve seen developers use the private use area, it’s been a mistake, and caused far more pain than it alleviated. Now that almost all characters in day-to-day use have been encoded, creating your own character codes is rarely the right solution to any problem. What the right solution is, I couldn’t tell you without knowing what your problem is.

  18. Bennett Says:

    I have used the private use area in string processing code. I wanted to process strings, but not to touch certain special substrings that may be present in the string. So I replaced the special substrings with private use codes according to a translation table, then processed the string, then translated the private use codes back to the special substrings. I assumed here that the string did not originally contain any private use codes. My use of the private use area was completely transient.Does that seem reasonable?

  19. Elliotte Rusty Harold Says:

    Not really, I’m afraid. There are a lot of things that could go wrong with that process.

  20. Michael Doran Says:

    8. Thou shalt generate all text in Normalization Form C whenever possible.

    I’ve tended towards Form D (Canonical Decomposition) as being the more desirable Unicode normalization form. I am curious as to the rationale for recommending Form C (Canonical Decomposition, followed by Canonical Composition).

  21. McDowell Says:

    Shouldn’t…

    6. Thou shalt count and index Unicode characters, not UTF-16 code points.

    …be…

    6. Thou shalt count and index Unicode code points, not UTF-16 code units.

  22. Elliotte Rusty Harold Says:

    Code points are an element of a particular encoding such as UTF-16, not an element of the Unicode character set.

  23. crab dip Says:

    Thanks for the good writeup. It if truth be told used to be a amusement account it.

    Look complicated to more introduced agreeable from you!
    By the way, how could we be in contact?

Leave a Reply