On 4/13/2012 8:12 AM, Jim Lawson wrote: > On 04/13/2012 08:33 AM, Stan Hoeppner wrote: >>> What I meant wasn't the drive throwing uncorrectable read errors but >>> the drives are returning different data that each think is correct or >>> both may have sent the correct data but one of the set got corrupted >>> on the fly. After reading the articles posted, maybe the correct term >>> would be the controller receiving silently corrupted data, say due to >>> bad cable on one. >> This simply can't happen. What articles are you referring to? If the >> author is stating what you say above, he simply doesn't know what he's >> talking about. > > > ?! Stan, are you really saying that silent data corruption "simply > can't happen"?
Yes, I did. Did you read the context in which I made that statement? > People who have been studying this have been talking > about it for years now. Yes, they have. Did you miss the paragraph where I stated exactly that? Did you also miss the part about the probably of such being dictated by total storage system size and access rate? > It can happen in the same way that Emmanuel > describes. No, it can't. Not in the way Emmanuel described. I already stated the reason, and all of this research backs my statement. You won't see this with a 2 drive mirror, or a 20 drive RAID10. Not until each drive has a capacity in the 15TB+ range, if not more, and again, depending on the total system size. This doesn't address the "RAID5", better known as "parity RAID" write hole, which is a separate issue. Which is also one of the reasons I don't use it. In lieu of an actual controller firmware bug, or mdraid or lvm bug, you'll never see this on small scale systems. > USENIX FAST08: > > http://static.usenix.org/event/fast08/tech/bairavasundaram.html > > CERN: > > http://storagemojo.com/2007/09/19/cerns-data-corruption-research/ > > http://fuji.web.cern.ch/fuji/talk/2007/kelemen-2007-C5-Silent_Corruptions.pdf > > LANL: > > http://institute.lanl.gov/resilience/conferences/2009/HPCResilience09_Michalak.pdf > > There are others if you search for it. This problem has been well-known > in large (petabyte+) data storage systems for some time. And again, this is the crux of it. One doesn't see this problem until one hits extreme scale, which I spent at least a paragraph or two explaining, referencing the same research. Please re-read my post at least twice, critically. Then tell me if I've stated anything substantively different than what any of these researches have. The statements "shouldn't" "wouldn't" and "can't" are based on probabilities. "Can't" or "won't" does not need equal probability 0. The probability of this type of silent data corruption occurring on a 2 disk or 20 disk array of today's drives is not zero over 10 years, but it is so low the effective statement is "can't" or "won't" see this corruption. As I said, when we reach 15-30TB+ disk drives, this may change for small count arrays. -- Stan