Comparing Strings For Equality

Java’s slogan is “Write once, run anywhere”; but perhaps it should be, “Write once, run anywhere except Turkey.” Java is a wonderful programming language that’s loved and adored around the world, but not in Turkey, a nation of more than 60 million people. Nor is Java all that popular with the millions of Turkish speakers outside of Turkey. Sun didn’t use the Turkish flag to diaper the tiger cub at JavaOne 2004, but for all the adoption Java’s seen in Turkey, they might as well have. It’s not just the Java language that’s unpopular, either. Most programs written in Java that run just fine in the United States exhibit intolerable bugs in Turkey.

The problem is the Turkish alphabet. Unlike most other languages, in Turkish the upper case form of the letter i is not I. Rather it’s İ, the dotted upper case i. In reverse, the lower case of I is not i as it is in English. Rather it’s ı, the dotless i. I with a dot and I without a dot are two separate letters in Turkish, and the upper and lower case forms reflect this.

To be fair, the problem is not really the Turkish alphabet, which is just fine, especially if you’re writing Turkish. The problem is that Java is over localized. It insists on using the local alphabet even when the English alphabet is required. In particular, the String class’s toUpperCase() method turns i into İ when run in a Turkish locale, but into I in the rest of the world. Similarly "I".toLowerCase() returns "i" in most of the world, except for Turkey where it returns "ı" instead.

This is exactly what should happen if you’re using toLowerCase() and toUpperCase() to format strings of Turkish to show to end users; for instance uppercasing a headline on a web page. However, this is not why most programmers usually call toLowerCase() and toUpperCase(). The common use of these two methods is to compare strings for equality in a case insensitive fashion. For instance, when comparing domain names, a programmer might write something like this:

String domain1 = "www.infinitecat.com";
String domain2 = "www.InfiniteCat.com";
if (domain1.toLowerCase().equals(domain2.toLowerCase()) {
  System.out.println("The domains are the same.");
}
else {
  System.out.println("The domains are different.");
}

If you run this code on a computer running a Turkish version of Windows or Linux or Mac OS X, the lowercase form of www.InfiniteCat.com is not www.infinitecat.com. Rather it’s www.ınfinitecat.com. It’s off by only one dot, but that dot is enough to introduce a bug into the program. Gotcha!

There are many other examples of strings where case doesn’t matter. URL schemes like mailto and MAILTO, HTML element names like div and DIV, and DOS and Macintosh file names like README.txt are just a few of the many uses of case folding in algorithms where Turkish rules can trip you up.

Comparing strings for equality in a case insensitive fashion requires a little more than simply converting both to upper case or both to lower case. Because case conversion is a language sensitive operation, the comparison must be done in the languages the strings are written in. Strings that are not meant for display to the end user are almost never written in Turkish. More often than not they’re written in English, but whatever language they’re written in, they need to be compared in that language, not Turkish or whatever other language the local system is configured to use.

The irony is that Sun was trying to do the right thing here. They thought that by making toLowerCase() and toUpperCase() locale-aware, they’d improve the internationalization of Java code. Sadly the effect was the opposite. Strings shown to the user (and functions that operate on those strings) do need to be localized. However, strings that exist purely in the code should not be. toLowerCase() and toUpperCase() are much more commonly used for code logic than end user presentation.

The fix is simple: don’t use the no-args versions of toLowerCase() and toUpperCase() as part of your program logic. Instead, pass in a Locale object that matches the string’s language. For example, if the language is English, you can compare strings like this:

if (domain1.toLowerCase(Locale.ENGLISH)
  .equals(domain2.toLowerCase(Locale.ENGLISH)) {
//…

The strings can be in another language besides English, of course. Just make sure that the locale matches whatever language you’re using for the strings (even Turkish). That way the code will give the same results no matter which country it’s run in. Always specifying a locale when comparing strings will open up a whole new market for your program. The Turks will thank you.

13 Responses to “Comparing Strings For Equality”

  1. Oliver Says:

    toLowerCase()

    It’s a bit like re-inventing the wheel, but Peter Norvig has written a version of toLowerCase() which avoids the Turkish-i/I problem and is also a lot faster than the standard implementation. Saves one from having to specify the locale each time. It can be found here:

    http://www.cs.biu.ac.il/~yakira/goodies/ifaq.html#tolowe

  2. edavies Says:

    Trivia

    Both branches of the if in your code fragment print “The domains are the same.”.

  3. Neil Says:

    Case-insignificant comparison

    Does the compareToIgnoreCase() method have the same issues, or does that always work?

  4. Elliotte Rusty Harold Says:

    Re: Case-insignificant comparison

    The JavaDoc for compareToIgnoreCase() says:

    This method returns an integer whose sign is that of calling compareTo with normalized versions of the strings where case differences have been eliminated by calling Character.toLowerCase(Character.toUpperCase(character)) on each character.

    Assuming the algorithm is implemented as specified, (and a quick peek at the source code shows it is, at least in the version of Sun’s JDK I have handy) it would have the same issue.

  5. jdf Says:

    Don’t create new strings just to compare them! You’d be better served by String.regionMatches().

  6. Tim Says:

    equalsIgnoreCase(String) also suffers the same problem. I’ve always used string1.equalsIgnoreCase(string2) thinking that it would take care of the messiness of case comparison. A quick look at the javadoc and source code suggests that it suffers the same problem. All it does is compare both the uppercase and lowercase version of each character. I wonder why they wouldn’t have added an equalsIgnoreCase(String, Locale) method? Also with regards to the link to Peter Norvig’s toLowerCase() implementation: I don’t think it solves the Turkish problem, but merely ignores it. In fact the javadoc of the given source says:

    Warning: Don’t use this method when your default locale is Turkey.

  7. edavies Says:

    Different letters in Unicode

    This problem also illustrates the uneasy issue of when two symbols are regarded as the “same” in Unicode. For example, the Cyrillic capital letter es (U+0421) looks like (is a homoglyph of) the Latin capital letter c (U+0043) but the two are, quite reasonably, regarded as different. Given the distinction between the dotted and undotted letter i in Turkish maybe even the dotted lower case form and the undotted upper case form should have been regarded as distinct from their Latin homoglyphs. Of course, it’s difficult to know where to draw the line – an English letter a is presumably the “same” as a French one, isn’t it? However, different case conversion rules seem to me to be enough to trigger a separation. Except then, of course, we need to worry whether accented lower case letters in Canadian French are different from those in European French. Hmm, not easy. Now off to read RFCs 3490, 3491 and 3492 on IDNs to see what all that fuss is about.

  8. John Cowan Says:

    edavies, that was thought about when Unicode was set up and repeatedly since. The trouble is that there is just too much legacy data (mostly in 8859-9 encoding) that doesn’t distinguish between Turkish i and non-Turkish i. There is simply no hope of getting people to make such a distinction systematically and correctly, so the problem won’t go away.

    Also, hexagonal French is moving back toward preserving accents in uppercase letters. They were basically dropped just to accommodate typewriters, which didn’t have enough keys to provide them.

  9. prozac for lovers. Jeanette Elamsson Says:

    prozac for lovers

    Prozac for lovers. The…

  10. danforth Says:

    Hey!

    Why don’t people use the Collator class?!?

    http://java.sun.com/j2se/1.4.2/docs/api/java/text/Collator.html

    “The Collator class performs locale-sensitive String comparison. You use this class to build searching and sorting routines for natural language text.”

  11. helle Says:

    Marvelous. Thanks, will spread this among my friends!

  12. Doug Held Says:

    I’ve recently received a lecture over dinner about how the Turkish i problem will never be solved. I suggested the same solution as above: “Different letters in Unicode” by edavies; but was reminded that the non Turkish, latin i is also in use in Turkey. For example, in product names and borrowed European names.

    When the European i is borrowed, Turkish users case it according to the Latin rules: i->I.

    My only suggestion is for Turks to remove ? and i from the keyboards, and just use the pipe character :-(

  13. cem Says:

    A true story…
    A small dot can change the meaning of the certain words a lot in Turkish… From “get bored” to “get f**ked”.

    Such an incident occurred in the past which have resulted with two deaths (one homicide -a woman stab to death by his husband and one suicide -which is the husband killed himself in prison later- ).

    Cause: Just miss-printed SMS text by the hardware first(turns letter “dotless-i” to a regular english “i”) than a woman who miss-interpreted message as a serious insult against her and her family’s honor in the heat of ongoing argument with his husband… which leads a dispute resulting to woman’s family attacks the husband and stabbing him in the chest with knife… Than wounded husband manages to seize the knife and stab his wife, render her to severely wounded and eventually die at the hospital later.

    The SMS message which the husband sent with his proper turkish char-set supported phone was…
    “s?k???nca konuyu de?i?tiriyorsun”
    with proper “dottlessi-i” in the word “s?k???nca”; meaning “you change the topic when you cornered”

    The SMS message which the wife received with her sub-par phone which miss interpret or print dotless-i’s as a regular english “i”s giving the words whole lot different meaning…
    “s*kisinca konuyu degistiriyorsun” meaning “you change the topic when you f**ked [with someone else]”

    The original turkish news link is (http://www.hurriyet.com.tr/gundem/8748359.asp?top=1).
    PS: I’ve try to translate the page with google-translate (to english) first… but ended up rolling-on-the-floor-laughing. I wonder when the google-translate stop translating Turkish to gibberish (or vice versa)

Leave a Reply