Re: [zfs-discuss] sympathetic (or just multiple) drive failures

Richard Elling Sat, 20 Mar 2010 13:26:56 -0700

On Mar 19, 2010, at 7:07 PM, zfs ml wrote:
> Most discussions I have seen about RAID 5/6 and why it stops "working" seem 
> to base their conclusions solely on single drive characteristics and 
> statistics.
> It seems to me there is a missing component in the discussion of drive 
> failures in the real world context of a system that lives in an environment 
> shared by all the system components - for instance, the video of the disks 
> slowing down when they are yelled at is a good visual example of the negative 
> effect of vibration on drives.  http://www.youtube.com/watch?v=tDacjrSCeq4
> 
> I thought the google and CMU papers talked about a surprisingly high (higher 
> than expected) rate of multiple drive failures of drives "nearby" each other, 
> but I couldn't find it when I re-=skimmed the papers now.
> 
> What are peoples' experiences with multiple drive failures? Given that we 
> often use same brand/model/batch drives (even though we are not supposed to), 
> same enclosure, same rack, etc for a given raid 5/6/z1/z2/z3 system, should 
> we be paying more attention to harmonics, vibration/isolation and 
> non-intuitive system level statistics that might be inducing close proximity 
> drive failures rather than just throwing more parity drives at the problem?


Yes :-)
Or to put this another way, when you have components in a system that are 
very reliable, the system failures become dominated by failures that are not 
directly attributed to the components. This is fallout from the notion of 
"synergy"
or the whole is greater than the sum of the parts.

synergy (noun) the interaction or cooperation of two or more organizations, 
substances,
or other agents to produce a combined effect greater than the sum of their 
separate 
effects.

> What if our enclosure and environmental factors increase the system level 
> statistics for multiple drive failures beyond the (used by everyone) single 
> drive failure statistics to the point where it is essentially negating the 
> positive effect of adding parity drives?

Statistical studies or reliability predictions for components do not take into
account causes such as factory contamination, environment, shipping/handling
events, etc.  The math is a lot easier if you can forget about such things.

> I realize this issue is not addressed because there is too much variability 
> in the enviroments, etc but I thought it would be interesting to see if 
> anyone has experienced much in terms of close time proximity, multiple drive 
> failures.

I see this on occasion. However, the cause is rarely attributed to a bad
batch of drives. More common is power supplies, HBA firmware, cables,
Pepsi syndrome, or similar.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 




_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] sympathetic (or just multiple) drive failures

Reply via email to