How to write network backup software: a lesson in practical optimization
In my Human Factors in API Design presentation at Architecture & Design World this past week, I claimed that classic optimization is rarely necessary. Pulling operations outside of loops or reducing the number of operations in a method rarely has any noticeable effect on performance.* Most real performance problems come from doing too many fundamentally slow operations; for instance, writing to the disk or reading from the network.
For example, you don’t want to open and close a database connection for every operation. Even on a LAN, that can easily hit you with one or two seconds (not milliseconds but seconds) of overhead per call. Do that a few hundred times and suddenly you’ve got an unusably slow application. Instead you need to:
- Cache and reuse the database connection(s) rather than constantly opening and closing new connections.
- Figure out how to reduce the number of database queries your application makes.
Most programmers who write database facing applications already know all this. There are numerous frameworks designed to make this sort of optimization automatically. That’s what a lot of middleware is about. Programmers who work with databases have either learned this lesson or involuntarily changed careers. It’s that important.
However, recently I’ve realized that another field has just as big a problem with network overhead as do database apps. However in this field the lesson does not seem to have been as widely learned. That field is backup software.
The problem, I suspect, is that much (not all) backup software was originally designed for local devices only. Thus there’s an implicit assumption that device access is pretty fast. In particular, disks are fast. Some optimization is done for tape drives, but very little for disks. Network backup is added as an afterthought. However, the network disks are treated as just local disks; and that’s what kills performance.
This software makes the classic mistake of abstracting out the network. It assumes you can pretend that a network mounted disk is just the same as a local disk, and that isn’t true. While you can get away with this simplification in programs that don’t do a lot of network transfer, backup software moves gigabytes of data and hundreds of thousands of files. It needs to understand and optimize for what it’s doing. In particular it needs to do two things:
- Cache and reuse the network sockets rather than constantly opening and closing new connections.
- Figure out how to reduce the number of network queries the application makes.
Sound familiar?
Making efficient use of finite bandwidth
The importance of number 1 depends heavily on how the network disks are mounted. For instance, NFS uses UDP rather than TCP and is thus more efficient for large operations over reliable LANs. However it’s still not an ideal protocol for the transfer of large volumes of data. Neither are FTP, HTTP, or most other common network protocols that are designed to transfer a few files at a time. If you’re going transfer the entire contents of disks, then you need to start thinking about buying, borrowing, or inventing a network protocol designed to do exactly that. You can’t just piggyback on top of the operating system’s built-in file sharing and expect decent performance. You need something that’s going to run close to the network’s theoretical maximum speed, not a protocol that’s going to get tied up in setting up and tearing down connections.
Because protocols like this aren’t built into most operating systems, you must install software on both ends of the connection. A backup server alone is insufficient. You need to run the backup protocol on both the clients and the server. Any backup software that is only installed on one end of the connection is guaranteed to be much slower than it could be.
Reducing network connections
The second task is to reduce the number of network queries the application makes. The first thing this means is that you reuse sockets where possible. Don’t open a new socket for each file. (Better yet: don’t use sockets at all. UDP is probably the better protocol for this purpose, at least if you’re planning to backup over a LAN. Over the public Internet, TCP may perform better.) Most importantly don’t open a new socket for every bit of metadata you collect, be it file name, directory contents, file modification dates, or anything else. All necessary metadata about the client should be collected on the client, bundled up into a single document, and sent to the server with one network connection.2
Also very important is to perform everything as locally as possible. Do not transfer data you don’t have to. For example, critical backups need to be verified. The data in the backup should be compared to the data in the source file. The user should be warned of any differences. Most backup software offers this option, but most backup software does it by comparing the files bit-by-bit. This means the entire file contents have to be transferred twice: once for the backup and once for the verification. This doubles the network traffic. Not good.
The right way to do network verification is to have the client calculate a checksum for its files locally, and send only the checksum to the server. The server can compare that checksum to what it sees on its local disks, and request a resend of the file or alert the user only if the checksums don’t match. For the vast majority of the cases where the file was transferred and stored sucessfully, the file is only sent once.
Now notice: this only applies to network backups. When you’re backing up local files to local devices, there’s no major reason to do this; but it’s critical for network backups. It’s a clear example of a case where abstracting out the network, and pretending network and local files are the same kills performance.
Some software does get this right. Usually this is the high end stuff. For instance, Dantz’s Retrospect operates this way. However, a lot of backup software doesn’t. In particular, you need to be wary of any software that uses the operating system’s built-in network mount points. That’s just not good enough, and no software that does that can perform adequately.
Multithreading
There is a third optimization that backup software can use, though this is really more of a user perception trick than a true speed up. Nonetheless this may be important, especially for synchronization software such as Chronologic. Don’t use multistage backups. Most software operates in this order:
- Figure out which files need to be backed up
- Back them up
- Verify them
For large backups, step one alone can take several minutes to an hour or more. If either the server or the client shuts down or goes to sleep before the initial scanning is complete, nothing has been accomplished.
To some extent, this reflects the linear access of traditional tape drives. However, relatively few of us are still backing up to tape. Random access backups to disk are much more common. In this case, it makes sense to interleave these three tasks. As soon as the software notices a file needs to be backed up, it should back it up. Then it should verify it. Usually you can run these three tasks in separate threads. Since the backup thread is network-bound and the other two threads are disk bound, there’s enough CPU speed to go around; and this might even be an absolute increase in performance. However, even if it’s not, there’s still a visible improvement in user perceived performance.
In conclusion
Network backups are a slow and painful process. Programmers of backup software should do everything they can to alleviate users’ pain, and thus encourage more frequent backups. In particular programmers should not put their own convenience ahead of the users’. Network mounted file systems are not the same as local file systems, and should not be treated as such. When writing software that transfers gigabytes of data and hundred of thousands of files, don’t reuse protocols that were never designed to handle this. Distinguish between network and local backups, and optimize the software to perform well with the task it’s given.
1 Of course that’s not all you need to do. You should also make sure the database itself is properly indexed and optimized for the common queries. You also need to make sure that logic is properly divided between the database and the application so that Java/C/C# does what a classic application is good at, and SQL does what SQL is good at.
2 After the presentation, Microsoft’s Krzysztof Cwalina told me low level optimization still matters in system software. I can believe that if you’re writing operating system software that will be used by billions of people; but let’s face it. Most of us aren’t writing software like that.
3 That’s probably a little too extreme a position. If you send everything with one network connection, you can’t start backing up the first file until all metadata has been collected. Instead I’d suggest collecting data on a couple of thousand or so files at a time, and sending that in one connection. Alternately you can wait until the server has run out of data to back up, and then let it ask the client to send as much as the client has accumulated up to that point. However, any way you slice it, you have to be very careful about each call to methods like getFileName
or isReadable()
. In some naive network file systems, each one of these is likely to result in a new network request.
July 22nd, 2006 at 12:43 pm
UDP? Just Say No!
Really, UDP is a terrible protocol for bulk transfer. You basically have three choices:
1) Send out everything as fast as you can with no waiting for acknowledgements. This is fine over a dedicated LAN or serial link, where you don’t care if you saturate it. It is unbelievably awful over an ordinary LAN, because other traffic will not be able to get through.
2) Ping-pong protocol: send a packet, wait for an ack, send a packet …. That’s way too slow. The application gets involved too often, especially at the receiving end.
3) Try to do better than either of these by being clever. You will end up reinventing TCP, badly.
All of these things were discovered almost as soon as Ethernet was invented, before there was even a way to do TCP/IP over Ethernet, using simple MAC-level protocols. TCP is a subtle and clever protocol that has been designed for this job over many iterations.
UDP is meant for one-shot notification or request/reply actions. Even DNS, the prototypical UDP protocol, uses TCP in order to do bulk transfer between primary and secondary servers. The only time you want to do bulk transfer over UDP is when you are trying to boot over the network using a crowded boot ROM that simply can’t implement the whole TCP stack, and then you use a ping-pong UDP protocol (TFTP) because you don’t care very much about performance when booting.
July 25th, 2006 at 7:36 pm
There is existing software that uses essentially a checksum comparison to test for data equality across a network (rather than moving and comparing the actual data). But, what about those instances, rare as they might be, when the checksum of two different data are equal (e.g., md5 collisions)? Checksums are good for accidental data corruption, but can it be truly robust means of comparing data?
(ps: I have a practical interest in this, since I do exactly this.)
July 26th, 2006 at 10:29 am
There is a famous paper called “End-to-end arguments in system design” that addresses this very issue. Wikipedia has a nice summary and links: http://en.wikipedia.org/wiki/End-to-end_principle
July 30th, 2006 at 10:30 am
Michael,
The rare instances of equal checksums but unequal data are covered under the collision statistics of your checksum algorithm. For MD5, SHA-1, SHA-256, etc. refer to the RFC. For CRC, refer to the RFC.
If a collision randomly occurs once in 2**80 operations, that’s still many orders of magnitude larger than the block-error rate of your hard drive. Since hard drives have integrity checks (i.e. CRCs) on the data of each raw block, the odds of getting a bad block (incorrect data) that passes the CRC integrity check (appears good) is a far more likely occurrance. If such a block randomly appeared in a data comparison, you’d back it up as the latest change to the file.
Even though MD5 or other algorithms may be marginal or not recommended for new crypto systems, remember that those systems have to withstand an attacker INTENTIONALLY trying to create different data with the same hash. That’s a completely different situation.
Frankly, you’re far more likely to get a bad block (unrecoverable data), or a bad block that looks good (as above), than a random hash collision.
August 1st, 2006 at 5:14 pm
UDP as a bulk data transfer mechanism has been commercialized by Digital Fountain (http://www.digitalfountain.com/). Basically (I believe) they break the transfer up into buckets and the buckets into a lot of little chunks and keep sending chunks until they get a message that a full bucket has been created. Research results from so-called byzantine protocols are used in the chunking so that each chunk carries information from the entire bucket so that each piece that gets through can contribute to the reassembly without all having to get through. This is both faster and more efficient than TCP’s implementation (send-ack-resend if needed). John’s comment that you would just end up re-creating TCP shows a lack of imagination. There are other reliable protocols that can run atop IP than just TCP.
September 7th, 2006 at 1:48 pm
[…] After evaluating Chronosync for a month, the evaluation period is up and it’s time to make a decision. To buy or not to buy, that is the question. I think the answer is no. Chronosync is too slow and too complex to justify paying for. […]