[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Richard Elling Tue, 26 Aug 2008 11:12:35 -0700

Miles Nordin wrote:
>>>>>> "jcm" == James C McPherson <[EMAIL PROTECTED]> writes:
>>>>>> "thp" == Todd H Poole <[EMAIL PROTECTED]> writes:
>>>>>> "mh" == Matt Harrison <[EMAIL PROTECTED]> writes:
>>>>>> "js" == John Sonnenschein <[EMAIL PROTECTED]> writes:
>>>>>> "re" == Richard Elling <[EMAIL PROTECTED]> writes:
>>>>>> "cg" == Carson Gaspar <[EMAIL PROTECTED]> writes:
>>>>>>             
>
>    jcm> Don't _ever_ try that sort of thing with IDE. As I mentioned
>    jcm> above, IDE is not designed to be able to cope with [unplugging
>    jcm> a cable]
>
> It shouldn't have to be designed for it, if there's controller
> redundancy.  On Linux, one drive per IDE bus (not using any ``slave''
> drives) seems like it should be enough for any electrical issue, but
> is not quite good enough in my experience, when there are two PATA
> busses per chip.  but one hard drive per chip seems to be mostly okay.
> In this SATA-based case, not even that much separation was necessary
> for Linux to survive on the same hardware, but I agree with you and
> haven't found that level with PATA either.
>
> OTOH, if the IDE drivers are written such that a confusing interaction
> with one controller chip brings down the whole machine, then I expect
> the IDE drivers to do better.  If they don't, why advise people to buy
> twice as much hardware ``because, you know, controllers can also fail,
> so you should have some controller redundancy''---the advice is worse
> than a waste of money, it's snake oil---a false sense of security.
>


No snake oil.  Pulling cables only simulates pulling cables.  If you
are having difficulty with cables falling out, then this problem cannot
be solved with software.  It *must* be solved with hardware.

But the main problem with "simulating disk failures by pulling cables"
is that the code paths executed during that test are different than those
executed when the disk fails in other ways.  It is not simply an issue
of the success or failure of the test, but it is an issue of what you are
testing.

Studies have shown that pulled cables is not the dominant failure
mode in disk populations.  Bairavasundaram et.al. [1] showed that
data checksum errors are much more common.  In some internal Sun
studies, we also see unrecoverable read as the dominant disk failure
mode. ZFS will do well for these errors, regardless of the underlying
OS.  AFAIK, none of the traditional software logical volume managers
nor the popular open source file systems (other than ZFS :-) address
this problem.

[1] 
http://www.usenix.org/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Reply via email to