1% Problems

I hate 1% problems. No this isn’t an OWS slogan. I’m thinking of those code issues that really aren’t a problem 99% of the time, but when they bite, they’re really hard to debug and they cause real pain. Several common cases in Java:

1. Using java.util.Date or java.util.Calendar instead of JodaTime.
2. Not specifying a Locale when doing language sensitive operations such as toLowerCase() and toUpperCase().
3. Not escaping strings passed to SQL, XML, HTML or other external formats.

What I hate most is that it’s really, really hard to convince other developers that these are problems they should take seriously. The excuses are common:

“No, I don’t have to specify a locale here because the strings are ASCII.”

“I’m only getting a timestamp; I don’t need a proper timezone.”

“The data we’re encoding is coming from a web service we control, and we know it’s not going to send us any formfeeds or null characters.”

“This string is a constant so we clearly don’t need to escape it”, and so on.

All these answers reduce to, “yes, there’s sort of a theoretical problem here; and maybe FindBugs is complaining; but it doesn’t really matter in this case, and I’ve got more important things to spend my time on.”

And you know what? The nay sayers are right, 99% of the time. The problem is that every one of these issues can bite badly that 1% of the time, and it’s usually not obvious when you’re in a 1% case. For instance, even because the string being used to construct an HTML attribute value today is a literal, doesn’t mean it won’t be refactored into a variable next year, and then a variable built from user input a year later. Suddenly there’s an XSRF vulnerability in your code that two years ago everyone agreed clearly couldn’t happen, and thus no effort was put into preventing it.

Worse yet, although these problems are very easy to spot at the source code level–indeed can often be detected algorithmically by tools such as PMD or FindBugs–it’s usually not obvious what the cause of the problem is once it does manifest itself. For instance, out of all the myriad reasons a SOAP call might be consistently failing, is the possibility that the data contains an invisible form feed character the first thing that comes to mind?

I have seen major production problems caused by every one of these (#2 just this past week, and #3 the week before) and every one many times more than once. In the case of the failure to properly escape web service input before generating XML, the bug had lived in the code for years before an errant form feed showed up in the data stream and cost several engineer days trying to understand and fix the problem.

These aren’t hard or costly problems to prevent or fix, if we just develop good coding habits. Anytime you see a SQL statement built by string concatenation, alarm bells ought to be sounding. Anytime you see getBytes() invoked on a string without specifying a character set, you shouldn’t have to think twice about changing it to getBytes(Charsets.UTF-8). Anytime you see java.util.Date or java.util.Calendar in code, you should know that something is likely to go wrong, and probably at the worst possible time.

It’s like seeing a large stack of heavy boxes piled in front of an emergency exit. You don’t have to think about it, estimate the risk of fixing it compared to the risk of leaving it as is, file bug reports, or prioritize it compared to everything else you have to do. You just fix it as quickly as you can. These are dangerous situations; they’re easy to spot; and as professionals we have a duty to fix them when we find them and not to cause them in the first place.

6 Responses to “1% Problems”

1. John Cowan Says:

I entirely agree, but my experience is that bugs like this always fall to the bottom of triage lists, since existing tests aren’t catching them and they produce no obvious bad behavior. So even though it would take only a brief time to get them fixed, they never have the priority.

2. Robert Hahn Says:

Good points.

But it looks to me like so many of these problems can and should be automated away. GetBytes() should default to utf-8 encoding. Strings should be automatically be escaped, doing extra work if you DON’T want them escaped. The conventions of yesterday should not be the conventions of today.

3. Sony Mathew Says:

Perhaps you could elaborate on why Calendar and Date are error prone? Also why UTF-8 must be specified? Doesn’t Java maintain all Strings as UTF-16?

4. Werner Slosse Says:

Developing client-server applications, whether RMI, CORBA, SOAP, etc is used, and/or uploading/downloading files in between client and server, I encounter the getBytes() and new String(byte[]) problem regularly. When not specifying the character set encoding (and not specified/overridden in JRE or cmd-line settings or args), the JRE takes the default system encoding. The default on a (Western-European) WinXP is windows-1252, on Unix/Linux it often is ISO-8859-1 (to name two, not in a particular order). Having files even worsens the case, as reading/writing files in text mode (Reader/Writer) has the same “problem”: the default encoding.

Java String is Unicode, for sure. But, whenever converting from bytes to text (and vice versa), it’s a good idea to know what the encoding should be, and to apply it. A simple test: write a file on WinXP (or Vista/7) in Notepad, save it, and open in IE (or another browser). In the browser one can often explicitly change the encoding, and check what text is shown. Do you actually know what your default OS encoding is? “Latin 1″ is not the same on all systems, ISO-8859-1 is not the same as windows-1252, note the differences in the C1 range (codepoints 0×80 to 0x9F).

Besides the byte/text problem, there’s an additional layer: the font. Even if the bytes/text/encoding are correct, it may not show correctly on screen, because the font used by the (G)UI does not know the [glyph for the] character. On the other side of the spectrum, there may be a database. And sure enough, a database engine/database/table/column has a character set encoding. E.g. running with a default windows-1252, storing a Czech character into a database may not be possible. When UTF-8 is used as encoding on the DB, the column size may not suffice (some databases will interpret a varchar 16 as 16 bytes, not 16 characters; so if you want to store 16 non-US-ASCII chars in a varchar column, size should be around 3..4 times 16).

All of the above result in myself often requesting Wireshark logs instead of application logs or UI screenshots. I want to see the raw bytes of the data in between client and server. If the bytes are correct, the data is correct (e.g. payload in a CORBA message or an XML message). If there’s still a problem, it’s in the interpretation of those bytes.

Indeed, in that 1% where there’s an error in that area, it’s difficult to track down. Assuming an ID (input text) is US-ASCII, and all of a sudden an ID contains ü, causing trouble in the client or server application.

5. John Cowan Says:

Explicit column sizes are a database smell. I don’t know of a single RDBMS that stores fixed-length strings differently from variable-length strings, so the only effect of declaring a fixed-length string as a table column is that you will sooner or later get into trouble. Just say VARCHAR(65535) or whatever the upper limit of your database is, and save yourself a world of annoyance.

6. Martin Valjavec Says:

Why is getBytes(Charsets.UTF-8) better than getBytes()? In my experience the real problem is that the developer often does not know what’s correct: it often is not even defined what should be correct. In that case getBytes() – the system default, whatever it is at execution time – might be the correct enconding to be used and UTF-8 might be wrong … or maybe not. Who defines this? Perhaps not the programmer. And who defines how to find out who defines this? In some organizations: nobody.

So this is not a “programmer’s only” problem.