On Fri, Apr 13, 2012 at 07:33:19AM -0500, Stan Hoeppner wrote: > > > > What I meant wasn't the drive throwing uncorrectable read errors but > > the drives are returning different data that each think is correct or > > both may have sent the correct data but one of the set got corrupted > > on the fly. After reading the articles posted, maybe the correct term > > would be the controller receiving silently corrupted data, say due to > > bad cable on one. > > This simply can't happen. What articles are you referring to? If the > author is stating what you say above, he simply doesn't know what he's > talking about.
It has happened to me, with RAID5 not RAID1. It was a firmware bug in the raid controller that caused the RAID array to go silently corrupted. The HW reported everything green -- but the filesystem was reporting lots of strange errors.. This LUN was part of a larger filesystem striped over multiple LUNs, so parts of the fs was OK, while other parts was corrupt. It was this bug: http://delivery04.dhe.ibm.com/sar/CMA/SDA/02igj/7/ibm_fw1_ds4kfc_07605200_anyos_anycpu.chg - Fix 432525 - CR139339 Data corruption found on drive after reconstruct from GHSP (Global Hot Spare) <snip> > In closing, I'll simply say this: If hardware, whether a mobo-down SATA > chip, or a $100K SGI SAN RAID controller, allowed silent data corruption > or transmission to occur, there would be no storage industry, and we'll > all still be using pen and paper. The questions you're asking were > solved by hardware and software engineers decades ago. You're fretting > and asking about things that were solved decades ago. Look at the plans are for your favorite fs: http://www.youtube.com/watch?v=FegjLbCnoBw They're planning on doing metadata checksumming to be sure they don't receive corrupted metadata from the backend storage, and say that data validation is a storage subsystem *or* application problem. Hardly a solved problem.. -jf