Comments on: 1% Problems

By: Martin Valjavec

Martin Valjavec — Tue, 29 Jan 2013 16:19:30 +0000

Why is getBytes(Charsets.UTF-8) better than getBytes()? In my experience the real problem is that the developer often does not know what’s correct: it often is not even defined what should be correct. In that case getBytes() – the system default, whatever it is at execution time – might be the correct enconding to be used and UTF-8 might be wrong … or maybe not. Who defines this? Perhaps not the programmer. And who defines how to find out who defines this? In some organizations: nobody.

So this is not a “programmer’s only” problem.

By: John Cowan

John Cowan — Sat, 11 Aug 2012 19:18:43 +0000

Explicit column sizes are a database smell. I don’t know of a single RDBMS that stores fixed-length strings differently from variable-length strings, so the only effect of declaring a fixed-length string as a table column is that you will sooner or later get into trouble. Just say VARCHAR(65535) or whatever the upper limit of your database is, and save yourself a world of annoyance.

By: Werner Slosse

Werner Slosse — Fri, 03 Aug 2012 10:48:08 +0000

Developing client-server applications, whether RMI, CORBA, SOAP, etc is used, and/or uploading/downloading files in between client and server, I encounter the getBytes() and new String(byte[]) problem regularly. When not specifying the character set encoding (and not specified/overridden in JRE or cmd-line settings or args), the JRE takes the default system encoding. The default on a (Western-European) WinXP is windows-1252, on Unix/Linux it often is ISO-8859-1 (to name two, not in a particular order). Having files even worsens the case, as reading/writing files in text mode (Reader/Writer) has the same “problem”: the default encoding.

Java String is Unicode, for sure. But, whenever converting from bytes to text (and vice versa), it’s a good idea to know what the encoding should be, and to apply it. A simple test: write a file on WinXP (or Vista/7) in Notepad, save it, and open in IE (or another browser). In the browser one can often explicitly change the encoding, and check what text is shown. Do you actually know what your default OS encoding is? “Latin 1” is not the same on all systems, ISO-8859-1 is not the same as windows-1252, note the differences in the C1 range (codepoints 0x80 to 0x9F).

Besides the byte/text problem, there’s an additional layer: the font. Even if the bytes/text/encoding are correct, it may not show correctly on screen, because the font used by the (G)UI does not know the [glyph for the] character. On the other side of the spectrum, there may be a database. And sure enough, a database engine/database/table/column has a character set encoding. E.g. running with a default windows-1252, storing a Czech character into a database may not be possible. When UTF-8 is used as encoding on the DB, the column size may not suffice (some databases will interpret a varchar 16 as 16 bytes, not 16 characters; so if you want to store 16 non-US-ASCII chars in a varchar column, size should be around 3..4 times 16).

All of the above result in myself often requesting Wireshark logs instead of application logs or UI screenshots. I want to see the raw bytes of the data in between client and server. If the bytes are correct, the data is correct (e.g. payload in a CORBA message or an XML message). If there’s still a problem, it’s in the interpretation of those bytes.

Indeed, in that 1% where there’s an error in that area, it’s difficult to track down. Assuming an ID (input text) is US-ASCII, and all of a sudden an ID contains ü, causing trouble in the client or server application.

By: Sony Mathew

Sony Mathew — Thu, 26 Jul 2012 17:19:04 +0000

Perhaps you could elaborate on why Calendar and Date are error prone? Also why UTF-8 must be specified? Doesn’t Java maintain all Strings as UTF-16?

By: Robert Hahn

Robert Hahn — Mon, 23 Jul 2012 00:02:00 +0000

Good points.

But it looks to me like so many of these problems can and should be automated away. GetBytes() should default to utf-8 encoding. Strings should be automatically be escaped, doing extra work if you DON’T want them escaped. The conventions of yesterday should not be the conventions of today.

By: John Cowan

John Cowan — Sun, 22 Jul 2012 20:30:27 +0000

I entirely agree, but my experience is that bugs like this always fall to the bottom of triage lists, since existing tests aren’t catching them and they produce no obvious bad behavior. So even though it would take only a brief time to get them fixed, they never have the priority.