On Jan 21, 2012, at 6:32 AM, Jim Klimov wrote: > 2012-01-21 0:33, Jim Klimov wrote: >> 2012-01-13 4:12, Jim Klimov wrote: >>> As I recently wrote, my data pool has experienced some >>> "unrecoverable errors". It seems that a userdata block >>> of deduped data got corrupted and no longer matches the >>> stored checksum. For whatever reason, raidz2 did not >>> help in recovery of this data, so I rsync'ed the files >>> over from another copy. Then things got interesting... >> >> >> Well, after some crawling over my data with zdb, od and dd, >> I guess ZFS was right about finding checksum errors - the >> metadata's checksum matched that of a block on original >> system, and the data block was indeed erring. > > Well, as I'm moving to close my quest with broken data, I'd > like to draw up some conclusions and RFEs. I am still not > sure if they are factually true, I'm still learning the ZFS > internals. So "it currently seems to me, that": > > 1) My on-disk data could get corrupted for whatever reason > ZFS tries to protect it from, at least once probably > from misdirected writes (i.e. the head landed not where > it was asked to write). It can not be ruled out that the > checksums got broken in non-ECC RAM before writes of > block pointers for some of my data, thus leading to > mismatches. One way or another, ZFS noted the discrepancy > during scrubs and "normal" file accesses. There is no > (automatic) way to tell which part is faulty - checksum > or data.
Untrue. If a block pointer is corrupted, then on read it will be logged and ignored. I'm not sure you have grasped the concept of checksums in the parent object. > > 2) In the case where on-disk data did get corrupted, the > checksum in block pointer was correct (matching original > data), but the raidz2 redundancy did not aid recovery. I think your analysis is incomplete. Have you determined the root cause? > > 3) The file in question was created on a dataset with enabled > deduplication, so at the very least the dedup bit was set > on the corrupted block's pointer and a DDT entry likely > existed. Attempts to rewrite the block with the original > one (having "dedup=on") failed in fact, probably because > the matching checksum was already in DDT. Works as designed. > > Rewrites of such blocks with "dedup=off" or "dedup=verify" > succeeded. > > Failure/success were tested by "sync; md5sum FILE" some > time after the fix attempt. (When done just after the > fix, test tends to return success even if the ondisk data > is bad, "thanks" to caching). No, I think you've missed the root cause. By default, data that does not match its checksum is not used. > > My last attempt was to set "dedup=on" and write the block > again and sync; the (remote) computer hung instantly :( > > 3*)The RFE stands: deduped blocks found to be invalid and not > recovered by redundancy should somehow be evicted from DDT > (or marked for required verification-before-write) so as > not to pollute further writes, including repair attmepts. > > Alternatively, "dedup=verify" takes care of the situation > and should be the recommended option. I have lobbied for this, but so far people prefer performance to dependability. > > 3**) It was suggested to set "dedupditto" to small values, > like "2". My oi_148a refused to set values smaller than 100. > Moreover, it seems reasonable to have two dedupditto values: > for example, to make a ditto copy when DDT reference counter > exceeds some small value (2-5), and add ditto copies every > "N" values for frequently-referenced data (every 64-128). > > 4) I did not get to check whether "dedup=verify" triggers a > checksum mismatch alarm if the preexisting on-disk data > does not in fact match the checksum. All checksum mismatches are handled the same way. > > I think such alarm should exist and to as much as a scrub, > read or other means of error detection and recovery would. Checksum mismatches are logged, what was your root cause? > > 5) It seems like a worthy RFE to include a pool-wide option to > "verify-after-write/commit" - to test that recent TXG sync > data has indeed made it to disk on (consumer-grade) hardware > into the designated sector numbers. Perhaps the test should > be delayed several seconds after the sync writes. There are highly-reliable systems that do this in the fault-tolerant market. > > If the verifcation fails, currently cached data from recent > TXGs can be recovered from on-disk redundancy and/or still > exist in RAM cache, and rewritten again (and tested again). > > More importantly, a failed test *may* mean that the write > landed on disk randomly, and the pool should be scrubbed > ASAP. It may be guessed that the yet-unknown error can lie > within "epsilon" tracks (sector numbers) from the currently > found non-written data, so if it is possible to scrub just > a portion of the pool based on DVAs - that's a preferred > start. It is possible that some data can be recovered if > it is tended to ASAP (i.e. on mirror, raidz, copies>1)... > > Finally, I should say I'm sorry for lame questions arising > from not reading the format spec and zdb blogs carefully ;) > > In particular, it was my understanding for a long time that > block pointers each have a sector of their own, leading to > overheads that I've seen. Now I know (and checked) that most > of the blockpointer tree is made of larger groupings (128 > blkptr_t's in a single 16KB block), reducing the impact of > BP's on fragmentation and/or slacky waste of large sectors > that I predicted and expected for the past year. > > Sad that nobody ever contradicted that (mis)understanding > of mine. Perhaps some day you can become a ZFS guru, but the journey is long... -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss