Yes, You Still Need to Specify a Character Set in Java 18+

Lately I’ve heard developers claim that it’s now OK to avoid specifying the character set when creating an InputStreamReader or String, or otherwise converting bytes into characters because Java now (JDK 18 and later) uses UTF-8 as its default character encoding regardless of platform.

Except we do still need to do it, for two independent reasons:

1. UTF-8 is still not the guaranteed, runtime character set that the various methods will use. JDKs can be configured to use a different default character set. Bugs from an incorrect default character set will now be even harder to find since they won’t be as obviously reproducible on all systems with a particular JDK.

2. Even if UTF-8 were the guaranteed, runtime character set that the various methods will use, that doesn’t make UTF-8 correct. It depends on the input you’re reading and the relevant specifications. Some of these use UTF-8. Some of these use ASCII or ISO 8859-1. A few use UTF-16 or something else. Just because the default character set is UTF-8 does not make any particular file or stream magically UTF-8. It is necessary to consider the context of the input source and choose the character encoding that is appropriate for that one source.

We know from decades of experience that default character sets are unsafe and buggy. The safest approach is to provide higher level libraries that only accept byte streams as input and do character set conversion themselves according to spec. This is how JSON and XML parsers usually operate. But that’s not always possible, and when it isn’t, the most secure and bug-resistant API requires developers to think about their choice of character encoding and make their choice explicit.

Leave a Reply