On Mar 19, 2010, at 7:07 PM, zfs ml wrote: > Most discussions I have seen about RAID 5/6 and why it stops "working" seem > to base their conclusions solely on single drive characteristics and > statistics. > It seems to me there is a missing component in the discussion of drive > failures in the real world context of a system that lives in an environment > shared by all the system components - for instance, the video of the disks > slowing down when they are yelled at is a good visual example of the negative > effect of vibration on drives. http://www.youtube.com/watch?v=tDacjrSCeq4 > > I thought the google and CMU papers talked about a surprisingly high (higher > than expected) rate of multiple drive failures of drives "nearby" each other, > but I couldn't find it when I re-=skimmed the papers now. > > What are peoples' experiences with multiple drive failures? Given that we > often use same brand/model/batch drives (even though we are not supposed to), > same enclosure, same rack, etc for a given raid 5/6/z1/z2/z3 system, should > we be paying more attention to harmonics, vibration/isolation and > non-intuitive system level statistics that might be inducing close proximity > drive failures rather than just throwing more parity drives at the problem?
Yes :-) Or to put this another way, when you have components in a system that are very reliable, the system failures become dominated by failures that are not directly attributed to the components. This is fallout from the notion of "synergy" or the whole is greater than the sum of the parts. synergy (noun) the interaction or cooperation of two or more organizations, substances, or other agents to produce a combined effect greater than the sum of their separate effects. > What if our enclosure and environmental factors increase the system level > statistics for multiple drive failures beyond the (used by everyone) single > drive failure statistics to the point where it is essentially negating the > positive effect of adding parity drives? Statistical studies or reliability predictions for components do not take into account causes such as factory contamination, environment, shipping/handling events, etc. The math is a lot easier if you can forget about such things. > I realize this issue is not addressed because there is too much variability > in the enviroments, etc but I thought it would be interesting to see if > anyone has experienced much in terms of close time proximity, multiple drive > failures. I see this on occasion. However, the cause is rarely attributed to a bad batch of drives. More common is power supplies, HBA firmware, cables, Pepsi syndrome, or similar. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss