In Praise of Draconian Error Handling, Part 1

I’m doing a bit of work on XOM, trying to optimize and improve some of the Unicode normalization code. A lot of this is autogenerated from the Unicode data files, and I’m actually working on the meta-code that parses those files and then generates the actual shipping code. In this code, I’m setting up a switch statement like this one:

       switch(i) {
          case 0:
            return result + "NOT_REORDERED";
          case 1:
            return result + "OVERLAY";
          case 7:
            return result + "NUKTA";
          case 8:
            return result + "KANA_VOICING";
          case 9:
            return result + "VIRAMA";
          case 202:
            return result + "ATTACHED_BELOW";
          case 216:
            return result + "ATTACHED_ABOVE_RIGHT";
          case 218:
            return result + "BELOW_LEFT";
          case 220:
            return result + "BELOW";
          case 222:
            return result + "BELOW_RIGHT";
          case 224:
            return result + "LEFT";
          case 226:
            return result + "RIGHT";
          case 228:
            return result + "ABOVE_LEFT";
          case 230:
            return result + "ABOVE";
          case 232:
            return result + "ABOVE_RIGHT";
          case 233:
            return result + "DOUBLE_BELOW";
          case 234:
            return result + "DOUBLE_ABOVE";
          case 240:
            return result + "IOTA_SUBSCRIPT";
          default:
            return result + "NOT_REORDERED";
        }

And then I stop myself. Do you see the bug? Actually it’s a meta bug that leads to the true bug.

The problem I’ve noticed is that I really have no justification for supplying a default value. If I parse the Unicode data files, and the Unicode consortium has added a new combining class, this code should fail instead of merely assigning that new class to an existing class. So I change the default block like so to throw an exception instead; not that this will ever happen of course. ๐Ÿ™‚

          default:
            throw new RuntimeException("New case! " + i);
        }

You can probably guess what happens next: I run the code and the very first time, it throw the exception. It seems I forgot to handle the case of 560. OK. Simple enough to fix. I check the data files to see what case 560 means. Strange, it doesn’t seem to be there. Where’s 560 coming from? And that’s when I realize the problem is much, much worse than I thought.

The real bug is that I’ve been treating hexadecimal constants from the Unicode data as decimal constants in my Java code without conversion. Almost everything the code is doing is wrong. It should look like this:

       switch(i) {
          case 0x0:
            return result + "NOT_REORDERED";
          case 0x1:
            return result + "OVERLAY";
          case 0x7:
            return result + "NUKTA";
          case 0x8:
            return result + "KANA_VOICING";
          case 0x9:
            return result + "VIRAMA";
          case 0x202:
            return result + "ATTACHED_BELOW";
          case 0x216:
            return result + "ATTACHED_ABOVE_RIGHT";
          case 0x218:
            return result + "BELOW_LEFT";
          case 0x220:
            return result + "BELOW";
          case 0x222:
            return result + "BELOW_RIGHT";
          case 0x224:
            return result + "LEFT";
          case 0x226:
            return result + "RIGHT";
          case 0x228:
            return result + "ABOVE_LEFT";
          case 0x230:
            return result + "ABOVE";
          case 0x232:
            return result + "ABOVE_RIGHT";
          case 0x233:
            return result + "DOUBLE_BELOW";
          case 0x234:
            return result + "DOUBLE_ABOVE";
          case 0x240:
            return result + "IOTA_SUBSCRIPT";
          default:
            throw new RuntimeException("New case! " + Integer.toHexString(i));
        }

Now the code runs and no classes are missing. But then I get a sinking feeling in the pit of my stomach. What about the other XOM file where I’m using these values? Are those hexadecimal or decimal? Damn it, they’re decimal!

OK. Don’t panic yet. They’re private named constants. The code has been shipping in production for 3+ years now, and no one’s reported a bug here. Maybe it doesn’t matter? It looks like most of the uses of these constants just involve equality comparisons of named constants, so the bug tends to cancel itself out. There are a couple of places where I compare a class to zero, but 0x0 == 0 so that’s safe. And there’s one line that gives me the willies where I compare to classes for relative size, but again if 0x230 > 0x154, then it’s also true that 230 > 154. Whew, that was a close one. I probably would have noticed this years ago except that by freak accident I was working with over 70 different hexadecimal constants, none of which used the digits A-F.

Wait a minute. Is that really plausible? Either someone deliberately set up those constants that way, or the bug isn’t what I thought it was. Maybe they aren’t really hexadecimal constants after all? But then where was that 560 coming from when the data files have 230? Wouldn’t you know it? I was parsing two values out of a line, treating both as hexadecimal, when one is hexadecimal and one isn’t. Does it really make sense to include both hexadecimal and decimal constants in the same line? But there’s not a lot I can do about that, so I change Integer.parseInt(m.group(2), 16) to Integer.parseInt(m.group(2)) and run the program again. At least this means the existing code is safe with its decimal constants.

The problem turns out not to have been as bad as I thought it was. However, it was still a real problem, and one I would not have noticed or found had I simply filled in a default value. The moral of the story is, don’t try to account for code errors you don’t expect. If something “can’t happen”, make sure you throw a RuntimeException when it does. Don’t just silently ignore it. Your intuition about what a program will do is vastly inferior to the experimental observation of what it does do.

Some programmers hate adding uncaught, unhandled runtime exceptions that may surface at unexpected times. (This one, at least, surfaced immediately.) The usual claim is that it’s unnecessary because the problem “can’t happen”. However, if it’s really true that the problem can’t happen, then it won’t happen, and the exception won’t harm anyone. And if they’re wrong and the problem does happen anyway, well then, at least the debugging will be a lot easier. Silent failure is still failure, and is much harder to diagnose and cure than screaming program crash due to an unexpected runtime exception.

I did have one advantage in this case. My program was a one-off piece of throw-away code run only by me. It has to give me the answer once, and then I don’t have to bother with it again. The only user is me, and I know what to do if it breaks. Code that has clients other than the programmer who wrote it, and especially non-programmer end users, may need to be more careful than this. It’s often wise to put a generic catch (RuntimeException ex) block or even a catch (Throwable ex) block in your main() method to more carefully handle unexpected exceptions. Such a block can log and report the problem, while still shielding the end user from the issue.

However an exception caused by a case that “can’t happen” but does is no different in this respect than a library bug or a null pointer exception from your own code. And it is still preferable to have the application fail as early, as fast, and as close to the actual bug as possible rather than silently ignore the failure until some other block of code fails as a result further down the line. Worst of all, the code could well run but generate incorrect data that might even be passed through multiple systems in multiple continents in multiple time zones and organizations before someone notices. Try debugging that some time.

8 Responses to “In Praise of Draconian Error Handling, Part 1”

  1. ed Says:

    For the record, I think it’s better to throw an Error in these situations, rather than a RuntimeException. YMMV.

    Sometimes users will have code with a ‘catch (Exception e)’ and an empty block, effectively ignoring all Exceptions, of which RuntimeException is a subclass. Throwing Error avoids this, although it can still get caught with ‘catch (Throwable t)’, so no amount of fool-proofing can avoid a truly dedicated fool.

    And I also suggest prefixing the toHexString() value with “0x”, to avoid propagating the ambiguity you fell prey to.

  2. Gareth Rees Says:

    On the general question of what to do in the event of a unanticipated error, surely the right approach depends on the application environment and intended group of users? In a development or testing environment, or when the users are sophisticated (e.g. your own test department, or programmers using your product), it’s absolutely the right thing to raise an exception in response to any unanticipated event. But in a production environment with ordinary users it’s essential that programs cope with unanticipated events. The archetypal case is that of embedded systems: when your car’s engine management system gets a bogus value from the fuel injection subsystem, it would be a bad idea to throw an exception and abort.

    I work in video games, where there are similar issues. If you get a bad result from a subsystem, you have to do something: it’s not acceptable just to crash or report a fatal error. Maybe the networking subsystem has gone wrong, but if the game continues to run then at least the player may be able to save their game.

    During testing, of course, it’s vital to be informed about unanticipated events, so the general technique we use looks like this:

    if (unanticipated_event) {
    if (testing) {
    assert(false, “Unexpected result: … details here …”);
    }
    // Do something sensible,
    // e.g. drop into the handling code for a different error,
    // or as a last resort just carry on as if nothing happened.
    }

    In the specific case of your XOM Unicode program, the highly repetitive structure looks really unfortunate. Is there nothing in Java to help you with this? In a language like Python or JavaScript with built-in hash tables you’d write something like:

    data = {
    0x0: “NOT_REORDERED”,
    0x1: “OVERLAY”,
    0x7: “NUKTA”,
    0x8: “KANA_VOICING”,
    0x9: “VIRAMA”,
    # etc …
    }
    try:
    return result + data[i]
    except KeyError:
    # Handle the unexpected key here.

    And in C or C++ you could use the relational macros technique.

  3. Gareth Rees Says:

    It would be nice if <pre> worked in comments. Also, it would be nice to have comment preview.

  4. John Cowan Says:

    Nothing in the Unicode data files is hexadecimal except the codepoint itself. In particular, the values of Canonical Combining Codes are definitely decimal.

  5. AndyT (lordpixel) Says:

    Gareth, I believe ERH’s overriding concern is parsing performance, and the switch wins.

    There are lots of alternatives in Java, but they all run more slowly.

  6. Elliotte Rusty Harold Says:

    Gareth, there’s an important distinction between problems caused by external conditions and problems caused by internal bugs. Programs need to be robust in the face of all possible external conditions. However they need to fail fast in the face of internal bugs. In Java this is the difference between checked and runtime exceptions.

    I may eventually move to a map structure if it seems necessary. The problem is the map is too large to embed in source code, close to 100,000 entries. Perhaps some other data structure may prove useful. If this isn’t fast enough, I’ll investigate. However the biggest concern right now is not code speed but code size. the current method skirts close to the boundaries of the maximum method size in Java, and not all tools can handle. This revision should reduce the method down to maybe a fifth of its current size and probaboly be faster to boot.

  7. Gareth Rees Says:

    they need to fail fast in the face of internal bugs

    I disagree with this claim. It depends on the environment, on the expertise of the users, and on the consequences of the various kinds of failure.

    Video games is an application area I know fairly well. Here it’s vital to “fail fast” in development, when you have a chance to analyze and fix the problem. But on a console, failing fast is almost always wrong. There’s nothing the player can do about it, and nothing much you could do about it even if the player could report the bug to you; the game completed its manufacturing run long ago and you have no way of issuing a patch.

    There’s almost always some way to handle an error that allows the game to continue running, or which minimizes the consequence of the error for the player. Even when there’s nothing you can do, often the game will be able to limp along despite the error. For example, in development we perform fencepost checking on all allocated memory blocks to detect buffer overruns and other memory corruption errors. But in release we don’t bother: there’s a chance that the undetected memory corruption is harmless, or merely overwrites part of a texture, resulting in graphic corruption but leaving the game still playable. And the worst that can happen is that the game crashes, which from the player’s point of view is no worse than the game throwing an exception.

    The decision might be different for PC games because there you do have the possibility of issuing a patch. But that’s exactly my point: the environment matters.

  8. Oleg Says:

    If you think that something will not happen than throw AssertionError instead of RuntimeException.

    The probability that you will accidentally mask AssertionError somewehre else in your code is much lower than the probability to accidentally mask RuntimeException.