Why Does Software Assume Hardware is Reliable?

I am getting quite tired of Mac OS X locking itself into the spinning beach ball of death due to failing hardware. In the last year or so I’ve had two LaCie hard drives go out on me, and a third seems to be going now. In each case, the Mac got thrown into a completely confused funk because the drive didn’t respond quickly enough or at all. Note that these drives were used strictly used as external backup device. They didn’t hold any operating system or application files.

Haven’t we learned by now that I/O is an unreliable operation, and that you should never assume that any read call will succeed, or will return data in the format you want? That’s true whether you’re talking about files or FireWire packets. Furthermore, critical components like the Finder should not block on synchronous I/O. Any I/O needs they have should be serviced asynchronously.

It’s hard to take the Mac seriously as a reliable server platform when it’s so easily brought to its knees by one misbehaving hard drive.

On the plus side, I just ordered two new 500 GB drives to replace the failing and failed 250 GB drives.* Two 500 GB drives cost almost exactly the same as one 250 GB drive did three years ago. Also, each 500 GB drive is about half the size of the old 250GB drive. That’s progress. 🙂

Larger LaCie hard drive next to smaller hard drive

One downside: the USB drives only have one port so I can’t daisy chain them like I did my FireWire drives.

Second downside: the data transfer rate on these drives seems to be roughly half what it was on the FireWire drives. Those could back up about half a gigabyte a minute. This one is only doing a quarter gigabyte per minute.

The new drives are USB 2.0 instead of FireWire. Maybe the Mac is a little less trustful of data coming in over the USB port. We shall see.

Actually the 500 GB drives are really 465.8 GB and the 250 GB drives are really 232.9 GB. LaCie lies about the capacity of both, but it’s a systematic error so the relative improvement is the same.

13 Responses to “Why Does Software Assume Hardware is Reliable?”

  1. Martin Probst Says:

    The interesting thing is that this happens across all operating systems I’ve ever used. The main file explorer application always hangs on some operation. Top candidates are not only failing drives, but network shares with intermittent or broken connectivity. This applies to the Finder, to the Windows Explorer, and to the similar Linux tools. Sucks big time, one should assume that – at least for network shares! – people figured out that one a long time ago.

  2. Where the progress is happening in hardware Says:

    […] While everyone is getting excited about multicore systems, something else is happening: I just ordered two new 500 GB drives to replace the failing and failed 250 GB drives. Both 500 GB drives cost almost exactly the same as one 250 GB drive did three years ago. That’s progress. (Source: Harold) […]

  3. nes Says:

    Preach it brother! 🙂 Seriously asynchronous programming is q0`

  4. Josh Peters Says:

    There is likely an issue with lack of dedicated controllers too. If all of the I/O buses are on-board and the Intel equivalent of a WinModem (ie the CPU does all of the “real” work for these devices) then it makes sense for a single process to hang the rest of the system.

    If you want a reliable server system run it on server hardware. The xServe uses dedicated controllers for its I/O. I would be surprised if it had the same blocking issues that USB does.

  5. nes Says:

    …asynchronous programming is obviously more difficult than synchronous. I guess OS programmers follow the “worse is better” philosophy.

  6. nes Says:

    ..and don’t get me started with device drivers. Either hardware and operating systems are messed up or device driver programmers should be shot. How else can you explain constantly messing up a program that consists of nothing but a simple API layer. Granted the worst culprits are bloated video and sound card drivers but I had issues with network cards, modems, printers, and so on too.

  7. bob Says:

    Trust me, USB 2.0 won’t help. If the drive flakes out, it’ll still wedge your Mac. It’s happened to me more than once with flaky USB flash drives, as well as one hard drive.

    I think the problem is the bus, which has to be driven with the right signalling protocol or it will get into weird states and not be able to recover.

    The least intrusive external storage I’ve seen, at least when it starts to die, is NAS on Ethernet. It’s still possible to wedge or hang, but I’ve seen it happen much less often.

  8. Farialima Says:

    In each case, the Mac got thrown into a completely confused funk because the drive didn’t respond quickly enough or at all.

    I think that I have experienced the same issue with LaCie external HD. I was able to recover the disk by reading them with a Windows machine. It would basically “fix” them — and then, after using them on the Mac again, I would start having the same issue again.

    So just a hint if those drive have data you’d like to recover: try reading them with a Windows machine 😮

  9. dgurba Says:

    How is the drive size wrong?

    Most drives say they are 500GB or 150GB and they really are … but when you format a drive a portion of the harddrive becomes occupied by system allocation tables and formatting overhead (journals, etc.). So I guess I’m asking did Lacie really give you a 465.8GB drive (in which case I’d be pissed!! :P) , and was that drive like 430GB after formatting it?? …

    I’ve never owned a Lacie … but might be good to know about these small typos 🙂

  10. mind Says:

    when you take apart the enclosures, there’s just a normal eide or sata drive in there. you could reuse your old enclosures if you still want firewire.

    storage is expected to have fast performance, which is why it assumes no errors. i really think there’s a market for a media filesystem that makes different assumptions about storage, like that it doesn’t need every last drop of performance, can tolerate errors, does raid-style redundancy with multiple drives, can have new drives added to it (or removed), etc.

  11. eric Says:

    Drive manufacturers use “giga” to mean 1,000,000,000 rather than 2^30. This works out to 465.6 GB for a 500 “Gigabyte” drive. There’s probably some extra sectors due to drive geometry, so that’s why you only get 465.8. They aren’t really lying, just using different units then the rest of us because it makes their drives sound bigger.

  12. bob Says:

    The best solution is to do a raid 5, it detects if one of your drives is failing then it backs up to one drive that is kept free of info. The more drives you add the better the ratio is. usb spends 30 some % of its info saying things like, is there anything new plugged in yet? constantly! or it keeps info about the files it is transferring so most of the bandwidth is lost just in it being usb. E-Sata is where it’s at.

  13. bob Says:

    FireWire vs. USB 2.0 – Architecture

    FireWire, uses a “Peer-to-Peer” architecture in which the peripherals are intelligent and can negotiate bus conflicts to determine which device can best control a data transfer

    Hi-Speed USB 2.0 uses a “Master-Slave” architecture in which the computer handles all arbitration functions and dictates data flow to, from and between the attached peripherals (adding additional system overhead and resulting in slower data flow control)

    FireWire vs. USB 2.0 Hard Drive Performance Comparison
    Read and write tests to the same IDE hard drive connected using FireWire and then Hi-Speed USB 2.0 show:

    Read Test:

    5000 files (300 MB total) FireWire was 33% faster than USB 2.0
    160 files (650MB total) FireWire was 70% faster than USB 2.0
    Write Test:

    5000 files (300 MB total) FireWire was 16% faster than USB 2.0
    160 files (650MB total) FireWire was 48% faster than USB 2.0