In my Human Factors in API Design presentation at Architecture & Design World this past week, I claimed that classic optimization is rarely necessary. Pulling operations outside of loops or reducing the number of operations in a method rarely has any noticeable effect on performance.* Most real performance problems come from doing too many fundamentally slow operations; for instance, writing to the disk or reading from the network.
For example, you don’t want to open and close a database connection for every operation. Even on a LAN, that can easily hit you with one or two seconds (not milliseconds but seconds) of overhead per call. Do that a few hundred times and suddenly you’ve got an unusably slow application. Instead you need to:
- Cache and reuse the database connection(s) rather than constantly opening and closing new connections.
- Figure out how to reduce the number of database queries your application makes.
Most programmers who write database facing applications already know all this. There are numerous frameworks designed to make this sort of optimization automatically. That’s what a lot of middleware is about. Programmers who work with databases have either learned this lesson or involuntarily changed careers. It’s that important.
However, recently I’ve realized that another field has just as big a problem with network overhead as do database apps. However in this field the lesson does not seem to have been as widely learned. That field is backup software.
The problem, I suspect, is that much (not all) backup software was originally designed for local devices only. Thus there’s an implicit assumption that device access is pretty fast. In particular, disks are fast. Some optimization is done for tape drives, but very little for disks. Network backup is added as an afterthought. However, the network disks are treated as just local disks; and that’s what kills performance.
This software makes the classic mistake of abstracting out the network. It assumes you can pretend that a network mounted disk is just the same as a local disk, and that isn’t true. While you can get away with this simplification in programs that don’t do a lot of network transfer, backup software moves gigabytes of data and hundreds of thousands of files. It needs to understand and optimize for what it’s doing. In particular it needs to do two things:
- Cache and reuse the network sockets rather than constantly opening and closing new connections.
- Figure out how to reduce the number of network queries the application makes.
Making efficient use of finite bandwidth
The importance of number 1 depends heavily on how the network disks are mounted. For instance, NFS uses UDP rather than TCP and is thus more efficient for large operations over reliable LANs. However it’s still not an ideal protocol for the transfer of large volumes of data. Neither are FTP, HTTP, or most other common network protocols that are designed to transfer a few files at a time. If you’re going transfer the entire contents of disks, then you need to start thinking about buying, borrowing, or inventing a network protocol designed to do exactly that. You can’t just piggyback on top of the operating system’s built-in file sharing and expect decent performance. You need something that’s going to run close to the network’s theoretical maximum speed, not a protocol that’s going to get tied up in setting up and tearing down connections.
Because protocols like this aren’t built into most operating systems, you must install software on both ends of the connection. A backup server alone is insufficient. You need to run the backup protocol on both the clients and the server. Any backup software that is only installed on one end of the connection is guaranteed to be much slower than it could be.
Reducing network connections
The second task is to reduce the number of network queries the application makes. The first thing this means is that you reuse sockets where possible. Don’t open a new socket for each file. (Better yet: don’t use sockets at all. UDP is probably the better protocol for this purpose, at least if you’re planning to backup over a LAN. Over the public Internet, TCP may perform better.) Most importantly don’t open a new socket for every bit of metadata you collect, be it file name, directory contents, file modification dates, or anything else. All necessary metadata about the client should be collected on the client, bundled up into a single document, and sent to the server with one network connection.2
Also very important is to perform everything as locally as possible. Do not transfer data you don’t have to. For example, critical backups need to be verified. The data in the backup should be compared to the data in the source file. The user should be warned of any differences. Most backup software offers this option, but most backup software does it by comparing the files bit-by-bit. This means the entire file contents have to be transferred twice: once for the backup and once for the verification. This doubles the network traffic. Not good.
The right way to do network verification is to have the client calculate a checksum for its files locally, and send only the checksum to the server. The server can compare that checksum to what it sees on its local disks, and request a resend of the file or alert the user only if the checksums don’t match. For the vast majority of the cases where the file was transferred and stored sucessfully, the file is only sent once.
Now notice: this only applies to network backups. When you’re backing up local files to local devices, there’s no major reason to do this; but it’s critical for network backups. It’s a clear example of a case where abstracting out the network, and pretending network and local files are the same kills performance.
Some software does get this right. Usually this is the high end stuff. For instance, Dantz’s Retrospect operates this way. However, a lot of backup software doesn’t. In particular, you need to be wary of any software that uses the operating system’s built-in network mount points. That’s just not good enough, and no software that does that can perform adequately.
There is a third optimization that backup software can use, though this is really more of a user perception trick than a true speed up. Nonetheless this may be important, especially for synchronization software such as Chronologic. Don’t use multistage backups. Most software operates in this order:
- Figure out which files need to be backed up
- Back them up
- Verify them
For large backups, step one alone can take several minutes to an hour or more. If either the server or the client shuts down or goes to sleep before the initial scanning is complete, nothing has been accomplished.
To some extent, this reflects the linear access of traditional tape drives. However, relatively few of us are still backing up to tape. Random access backups to disk are much more common. In this case, it makes sense to interleave these three tasks. As soon as the software notices a file needs to be backed up, it should back it up. Then it should verify it. Usually you can run these three tasks in separate threads. Since the backup thread is network-bound and the other two threads are disk bound, there’s enough CPU speed to go around; and this might even be an absolute increase in performance. However, even if it’s not, there’s still a visible improvement in user perceived performance.
Network backups are a slow and painful process. Programmers of backup software should do everything they can to alleviate users’ pain, and thus encourage more frequent backups. In particular programmers should not put their own convenience ahead of the users’. Network mounted file systems are not the same as local file systems, and should not be treated as such. When writing software that transfers gigabytes of data and hundred of thousands of files, don’t reuse protocols that were never designed to handle this. Distinguish between network and local backups, and optimize the software to perform well with the task it’s given.
1 Of course that’s not all you need to do. You should also make sure the database itself is properly indexed and optimized for the common queries. You also need to make sure that logic is properly divided between the database and the application so that Java/C/C# does what a classic application is good at, and SQL does what SQL is good at.
2 After the presentation, Microsoft’s Krzysztof Cwalina told me low level optimization still matters in system software. I can believe that if you’re writing operating system software that will be used by billions of people; but let’s face it. Most of us aren’t writing software like that.
3 That’s probably a little too extreme a position. If you send everything with one network connection, you can’t start backing up the first file until all metadata has been collected. Instead I’d suggest collecting data on a couple of thousand or so files at a time, and sending that in one connection. Alternately you can wait until the server has run out of data to back up, and then let it ask the client to send as much as the client has accumulated up to that point. However, any way you slice it, you have to be very careful about each call to methods like
isReadable(). In some naive network file systems, each one of these is likely to result in a new network request.