The Ten Commandments of Unicode

1. I am Unicode, thy character set. Thou shalt have no other character sets before me.

2. Thou shalt carefully specify the character encoding and the character set whenever reading a text file.

3. Thou shalt not refer to any 8-bit character set as “ASCII”.

4. Thou shalt ensure that all string handling functions fully support characters from beyond the Basic Multilingual Plane. Thou shalt not refer to Unicode as a two-byte character set.

5. Thou shalt plan for additions of future characters to Unicode.

6. Thou shalt count and index Unicode characters, not UTF-16 code points.

7. Thou shalt use UTF-8 as the preferred encoding wherever possible.

8. Thou shalt generate all text in Normalization Form C whenever possible.

9. Thou shalt avoid deprecated characters.

10. Thou shalt not enter the private use area.

This entry was posted on Friday, March 7th, 2008 at 12:17 pm and is filed under Programming. You can follow any responses to this entry through the Atom feed. Both comments and pings are currently closed.

John Cowan Says:
March 7th, 2008 at 1:22 pm

I think #8 should say “the preferred external encoding”. There are a lot of environments in which UTF-16 is overwhelmingly dominant (the Windows API, Java, C#, Javascript) and trying to use UTF-8 internally is just much more work than it’s worth.

Corey Says:
March 8th, 2008 at 11:30 am

Related to #2, I wish to god they would deprecate String(byte[]) and String.getBytes(). Not specifying the character set is a bug waiting to happen when dealing with cross-platform applications.

Jerry "these boots" Amayedfore Says:
March 9th, 2008 at 5:16 pm

I wish “they” (i.e. Corey) would not assume that Java is the only programming language.

Brian Says:
March 9th, 2008 at 9:12 pm

Nothing about normalization?

Regarding #2: How can the consumer specify the encoding if the producer didn’t specify it (either externally, or with a BOM)?

John Cowan Says:
March 9th, 2008 at 10:47 pm

Brian: #8 is about normalization. As for #2, I take it to mean “Don’t assume, when opening an input stream, that the system default is suitable: provide an appropriate encoding yourself.”

Oren Says:
March 16th, 2008 at 9:44 am

3.5. Thou shalt not refer to any 8-bit character set as “ANSI”.

Oren Says:
March 16th, 2008 at 9:52 am

IMHO, asking everyone to fully support characters beyond the BMP is a bit of an overkill. Unless you are writing software for certain special niche applications you should be fine as long as your code:

Does not choke on surrogate pairs.
Correctly preserves surrogate pairs when saving files.
NEVER emits a lone surrogate character.

Mark Says:
March 16th, 2008 at 11:21 am

11. Thou shalt prefer to use ASCII if Unicode can be avoided

tim Says:
March 16th, 2008 at 2:56 pm

But in most cases, you don’t know what “niche” your application will be used in. Of all the programs I’ve written in my life, across many industries, the only one which could (today) safely ignore non-BMP was aircraft design software, and that’s because they used upper-case ASCII for everything (in the 21st century … yeah, I don’t get it either).

Supporting all of Unicode isn’t that hard. If your library (or program) doesn’t support it, it’s going to be a royal pain for a lot of us who want to use it.

In 1990 supporting just ASCII was fine. Then the world grew up, and we got Unicode. You get no pity from me for not supporting all of Unicode in 2008.

Slava Pestov Says:
March 16th, 2008 at 11:34 pm

It’s funny that Java violates 2, 3, 4, 6, 7, 8 on that list.

Oren Says:
March 17th, 2008 at 4:13 pm

Tim, I never said non-BMP characters can be ignored. There are lots of things in unicode that look like a single “character” (glyph) but are actually a sequence of a base character followed by combining characters. Even using fully composed form eliminates only some of them. So what difference does it make if the sequence is a surroage pair or a combining sequence? In both cases it’s something your program should not be messing with it if doesn’t understand the nuances. The most it can do safely is to concatenate such strings or maybe split them on well-defined separator characters so you know you will not be splitting a multicharacter sequence in the middle.

But that is all most application software really ever does with strings, anyway. Software that really needs to process them as individual codepoints or glyphs is not written very often. Writing such software requires good understanding of unicode concepts beyond this issue, anyway (e.g. the difference between codepoints and glyphs).

Mark Thornton Says:
March 19th, 2008 at 4:34 am

Remember that UTF-8 is not a very nice encoding for non western (e.g. Russian, Chinese, etc) languages.

Elliotte Rusty Harold Says:
March 19th, 2008 at 10:55 pm

That’s a common misconception. UTF-8 is perfectly fine for all languages supported in Unicode. In fact, it has a number of very nice properties that make it superior for all scripts. See this article for more details.

Mark Thornton Says:
March 27th, 2008 at 10:26 am

UTF-8 roughly doubles the size of Russian text and other scripts which usual have a single byte code page but have all common letters with codes > 127. True, the effect on Chinese is less pronounced.

Elliotte Rusty Harold Says:
March 27th, 2008 at 10:34 am

And that was relevant in 1987. Today, who really cares? Doubling the size of text (and only text), just doesn’t matter any more. It’s not plain text that’s causing network neutrality disputes and filled hard drives.

In fact, in many circumstances, including transmission over HTTP, encoding Russian in UTF-8 does not double its size. Even encoding it in UTF-32 wouldn’t double its size. HTTP and modern HTTP servers and clients are a lot smarter than that. Chances are pure Russian text is going out across the network in less than one byte per character no matter which encoding you use.

SusanJ Says:
April 2nd, 2008 at 8:45 am

I’m not sure I understand the prohibition on the Private Use Area. I have an application where I need to create my own character codes. What should I do?

Elliotte Rusty Harold Says:
April 2nd, 2008 at 9:12 am

All I can say is that in nearly all the cases where I’ve seen developers use the private use area, it’s been a mistake, and caused far more pain than it alleviated. Now that almost all characters in day-to-day use have been encoded, creating your own character codes is rarely the right solution to any problem. What the right solution is, I couldn’t tell you without knowing what your problem is.

Bennett Says:
April 23rd, 2008 at 5:19 pm

I have used the private use area in string processing code. I wanted to process strings, but not to touch certain special substrings that may be present in the string. So I replaced the special substrings with private use codes according to a translation table, then processed the string, then translated the private use codes back to the special substrings. I assumed here that the string did not originally contain any private use codes. My use of the private use area was completely transient.Does that seem reasonable?

Elliotte Rusty Harold Says:
April 2nd, 2009 at 7:14 am

Not really, I’m afraid. There are a lot of things that could go wrong with that process.

Michael Doran Says:
May 12th, 2009 at 3:08 pm

I’ve tended towards Form D (Canonical Decomposition) as being the more desirable Unicode normalization form. I am curious as to the rationale for recommending Form C (Canonical Decomposition, followed by Canonical Composition).

McDowell Says:
April 19th, 2013 at 10:36 am

Shouldn’t…

…be…

6. Thou shalt count and index Unicode code points, not UTF-16 code units.

Elliotte Rusty Harold Says:
May 11th, 2013 at 7:39 am

Code points are an element of a particular encoding such as UTF-16, not an element of the Unicode character set.

What do I need to know about Unicode? | ASK AND ANSWER Says:
December 19th, 2015 at 4:16 am

[…] I also like Elliotte Rusty Harold’s Ten Commandments of Unicode. […]

GZIPInputStream reading line by line | ASK AND ANSWER Says:
January 12th, 2016 at 4:15 pm

[…] to explicitly specify the encoding is against the second commandment. Use the default encoding at your […]

What is XML BOM and how do I detect it? - QuestionFocus Says:
December 1st, 2017 at 9:51 am

[…] advocate encoding as Unicode wherever possible (see also the 10 Commandments of Unicode). That said, XML allows the representation of any Unicode character via escape entities (e.g. […]

What is XML BOM and how do I detect it? – inneka.com Says:
September 28th, 2019 at 11:04 am

java – GZIPInputStream reading line by line-ThrowExceptions – ThrowExceptions Says:
March 19th, 2020 at 7:40 am

GZIPInputStream reading line by line - Tutorial Guruji Says:
May 24th, 2021 at 4:34 am

The Ten Commandments of Unicode

28 Responses to “The Ten Commandments of Unicode”

Info

Archives

Categories

Feeds

Admin