Most discussions I have seen about RAID 5/6 and why it stops "working" seem to
base their conclusions solely on single drive characteristics and statistics.
It seems to me there is a missing component in the discussion of drive
failures in the real world context of a system that lives in an environment
shared by all the system components - for instance, the video of the disks
slowing down when they are yelled at is a good visual example of the negative
effect of vibration on drives. http://www.youtube.com/watch?v=tDacjrSCeq4
I thought the google and CMU papers talked about a surprisingly high (higher
than expected) rate of multiple drive failures of drives "nearby" each other,
but I couldn't find it when I re-=skimmed the papers now.
What are peoples' experiences with multiple drive failures? Given that we
often use same brand/model/batch drives (even though we are not supposed to),
same enclosure, same rack, etc for a given raid 5/6/z1/z2/z3 system, should we
be paying more attention to harmonics, vibration/isolation and non-intuitive
system level statistics that might be inducing close proximity drive failures
rather than just throwing more parity drives at the problem?
What if our enclosure and environmental factors increase the system level
statistics for multiple drive failures beyond the (used by everyone) single
drive failure statistics to the point where it is essentially negating the
positive effect of adding parity drives?
I realize this issue is not addressed because there is too much variability in
the enviroments, etc but I thought it would be interesting to see if anyone
has experienced much in terms of close time proximity, multiple drive failures.
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss