Comments on: How to write network backup software: a lesson in practical optimization

By: Mokka mit Schlag » Chronosync: Final Answer

Mokka mit Schlag » Chronosync: Final Answer — Thu, 07 Sep 2006 18:48:37 +0000

[…] After evaluating Chronosync for a month, the evaluation period is up and it’s time to make a decision. To buy or not to buy, that is the question. I think the answer is no. Chronosync is too slow and too complex to justify paying for. […]

By: verisimilidude

verisimilidude — Tue, 01 Aug 2006 22:14:23 +0000

UDP as a bulk data transfer mechanism has been commercialized by Digital Fountain (http://www.digitalfountain.com/). Basically (I believe) they break the transfer up into buckets and the buckets into a lot of little chunks and keep sending chunks until they get a message that a full bucket has been created. Research results from so-called byzantine protocols are used in the chunking so that each chunk carries information from the entire bucket so that each piece that gets through can contribute to the reassembly without all having to get through. This is both faster and more efficient than TCP’s implementation (send-ack-resend if needed). John’s comment that you would just end up re-creating TCP shows a lack of imagination. There are other reliable protocols that can run atop IP than just TCP.

By: Randall

Randall — Sun, 30 Jul 2006 15:30:33 +0000

Michael,

The rare instances of equal checksums but unequal data are covered under the collision statistics of your checksum algorithm. For MD5, SHA-1, SHA-256, etc. refer to the RFC. For CRC, refer to the RFC.

If a collision randomly occurs once in 2**80 operations, that’s still many orders of magnitude larger than the block-error rate of your hard drive. Since hard drives have integrity checks (i.e. CRCs) on the data of each raw block, the odds of getting a bad block (incorrect data) that passes the CRC integrity check (appears good) is a far more likely occurrance. If such a block randomly appeared in a data comparison, you’d back it up as the latest change to the file.

Even though MD5 or other algorithms may be marginal or not recommended for new crypto systems, remember that those systems have to withstand an attacker INTENTIONALLY trying to create different data with the same hash. That’s a completely different situation.

Frankly, you’re far more likely to get a bad block (unrecoverable data), or a bad block that looks good (as above), than a random hash collision.

By: Adam Rosien

Adam Rosien — Wed, 26 Jul 2006 15:29:22 +0000

There is a famous paper called “End-to-end arguments in system design” that addresses this very issue. Wikipedia has a nice summary and links: http://en.wikipedia.org/wiki/End-to-end_principle

By: Michael

Michael — Wed, 26 Jul 2006 00:36:37 +0000

There is existing software that uses essentially a checksum comparison to test for data equality across a network (rather than moving and comparing the actual data). But, what about those instances, rare as they might be, when the checksum of two different data are equal (e.g., md5 collisions)? Checksums are good for accidental data corruption, but can it be truly robust means of comparing data?

(ps: I have a practical interest in this, since I do exactly this.)

By: John Cowan

John Cowan — Sat, 22 Jul 2006 17:43:19 +0000

UDP? Just Say No!

Really, UDP is a terrible protocol for bulk transfer. You basically have three choices:

1) Send out everything as fast as you can with no waiting for acknowledgements. This is fine over a dedicated LAN or serial link, where you don’t care if you saturate it. It is unbelievably awful over an ordinary LAN, because other traffic will not be able to get through.

2) Ping-pong protocol: send a packet, wait for an ack, send a packet …. That’s way too slow. The application gets involved too often, especially at the receiving end.

3) Try to do better than either of these by being clever. You will end up reinventing TCP, badly.

All of these things were discovered almost as soon as Ethernet was invented, before there was even a way to do TCP/IP over Ethernet, using simple MAC-level protocols. TCP is a subtle and clever protocol that has been designed for this job over many iterations.

UDP is meant for one-shot notification or request/reply actions. Even DNS, the prototypical UDP protocol, uses TCP in order to do bulk transfer between primary and secondary servers. The only time you want to do bulk transfer over UDP is when you are trying to boot over the network using a crowded boot ROM that simply can’t implement the whole TCP stack, and then you use a ping-pong UDP protocol (TFTP) because you don’t care very much about performance when booting.