> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Jim Klimov > > 2012-01-15 19:38, Edward Ned Harvey wrote: > >> 1) How does raidzN protect agaist bit-rot without known full > >> death of a component disk, if it at all does? > > zfs can read disks 1,2,3,4... Then read disks 1,2,3,5... > > Then read disks 1,2,4,5... ZFS can figure out which disk > > returned the faulty data, UNLESS the disk actually returns > > correct data upon subsequent retries. > > Makes sense, if ZFS does actually do that ;) > > Counter-examples: > 1) For several scrubs in a row, my pool consistently found two > vdev errors and one pool error with zero per-disk errors > (further leading to error in some object <metadata>:<0x0>). > If the disk-read errors were transient, sometimes returning > correct data (i.e. bad sector relocation was successful in > the background), ZFS would receive good blocks on further > scrubs - shouldn't it?
I can't say this is the explanation for your situation, but I can offer it as one possible explanation: Suppose your system is in operation, and you get corruption in your CPU or RAM, so it calculates the wrong cksum for the data that is about to be written. The data gets written, along with the wrong cksum. Later, you come along and read that data. You discover the cksum error, it's unrecoverable, but there are no disk errors. I have certainly experienced CPU's that perform incorrect calculations before - and I have certainly encountered errant memory before - Usually when a component starts failing like that, it progressively gets worse (or at least you can usually run some diag utils) and you can identify the failing component. But not always. Such failures can happen undetected with or without ECC memory. It's simply less likely with ECC. The whole thing about ECC memory... It's just doing parity. It's a very weak checksum. If corruption happens in memory, it's FAR more likely that the error will go undetected by ECC as compared to the Fletcher or SHA checksum that's being used by ZFS. Even when you get down to the actual disk... All disks store parity / checksum information, using their FEC chip. All disks will attempt to detect and correct errors they encounter (this is even stronger than ECC memory). But nothing's perfect, not even SHA... But the accuracy of Fletcher or SHA is far, far greater than the ECC or FEC being used by your memory and disks. _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss