Most discussions I have seen about RAID 5/6 and why it stops "working" seem to base their conclusions solely on single drive characteristics and statistics. It seems to me there is a missing component in the discussion of drive failures in the real world context of a system that lives in an environment shared by all the system components - for instance, the video of the disks slowing down when they are yelled at is a good visual example of the negative effect of vibration on drives. http://www.youtube.com/watch?v=tDacjrSCeq4

I thought the google and CMU papers talked about a surprisingly high (higher than expected) rate of multiple drive failures of drives "nearby" each other, but I couldn't find it when I re-=skimmed the papers now.

What are peoples' experiences with multiple drive failures? Given that we often use same brand/model/batch drives (even though we are not supposed to), same enclosure, same rack, etc for a given raid 5/6/z1/z2/z3 system, should we be paying more attention to harmonics, vibration/isolation and non-intuitive system level statistics that might be inducing close proximity drive failures rather than just throwing more parity drives at the problem?

What if our enclosure and environmental factors increase the system level statistics for multiple drive failures beyond the (used by everyone) single drive failure statistics to the point where it is essentially negating the positive effect of adding parity drives?

I realize this issue is not addressed because there is too much variability in the enviroments, etc but I thought it would be interesting to see if anyone has experienced much in terms of close time proximity, multiple drive failures.
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to