Nathan: yes. Flipping each bit and recomputing the checksum is not only possible, we actually did it in early versions of the code. The problem is that it's really expensive. For a 128K block, that's a million bits, so you have to re-run the checksum a million times, on 128K of data. That's 128GB of data to churn through.
So Bob: you're right too. It's generally much cheaper to retry the I/O, try another disk, try a ditto block, etc. That said, when all else fails, a 128GB computation is a lot cheaper than a restore from tape. At some point it becomes a bit philosophical. Suppose the block in question is a single user data block. How much of the machine should you be willing to dedicate to getting that block back? I mean, suppose you knew that it was theoretically possible, but would consume 500 hours of CPU time during which everything else would be slower -- and the affected app's read() system call would hang for 500 hours. What is the right policy? There's no one right answer. If we were to introduce a feature like this, we'd need some admin-settable limit on how much time to dedicate to it. For some checksum functions like fletcher2 and fletcher4, it is possible to do much better than brute force because you can compute an incremental update -- that is, you can compute the effect of changing the nth bit without rerunning the entire checksum. This is, however, not possible with SHA-256 or any other secure hash. We ended up taking that code out because single-bit errors didn't seem to arise in practice, and in testing, the error correction had a rather surprising unintended side effect: it masked bugs in the code! The nastiest kind of bug in ZFS is something we call a future leak, which is when some change from txg (transaction group) 37 ends up going out as part of txg 36. It normally wouldn't matter, except if you lost power before txg 37 was committed to disk. On reboot you'd have inconsistent on-disk state (all of 36 plus random bits of 37). We developed coding practices and stress tests to catch future leaks, and as I know we've never actually shipped one. But they are scary. If you *do* have a future leak, it's not uncommon for it to be a very small change -- perhaps incrementing a counter in some on-disk structure. The thing is, if the counter is going from even to odd, that's exactly a one-bit change. The single-bit error correction logic would happily detect these and fix them up -- not at all what you want when testing! (Of course, we could turn it off during testing -- but then we wouldn't be testing it.) All that said, I'm still occasionally tempted to bring it back. It may become more relevant with flash memory as a storage medium. Jeff On Sun, Mar 02, 2008 at 05:28:48PM -0600, Bob Friesenhahn wrote: > On Mon, 3 Mar 2008, Nathan Kroenert wrote: > > Speaking of expensive, but interesting things we could do - > > > > From the little I know of ZFS's checksum, it's NOT like the ECC > > checksum we use in memory in that it's not something we can use to > > determine which bit flipped in the event that there was a single bit > > flip in the data. (I could be completely wrong here... but...) > > It seems that the emphasis on single-bit errors may be misplaced. Is > there evidence which suggests that single-bit errors are much more > common than multiple bit errors? > > > What is the chance we could put a little more resilience into ZFS such > > that if we do get a checksum error, we systematically flip each bit in > > sequence and check the checksum to see if we could in fact proceed > > (including writing the data back correctly.). > > It is easier to retry the disk read another 100 times or store the > data in multiple places. > > > Or build into the checksum something analogous to ECC so we can choose > > to use NON-ZFS protected disks and paths, but still have single bit flip > > protection... > > Disk drives commonly use an algorithm like Reed Solomon > (http://en.wikipedia.org/wiki/Reed-Solomon_error_correction) which > provides forward-error correction. This is done in hardware. Doing > the same in software is likely to be very slow. > > > What do others on the list think? Do we have enough folks using ZFS on > > HDS / EMC / other hardware RAID(X) environments that might find this useful? > > It seems that since ZFS is intended to support extremely large storage > pools, available energy should be spent ensuring that the storage pool > remains healthy or can be repaired. Loss of individual file blocks is > annoying, but loss of entire storage pools is devastating. > > Since raw disk is cheap (and backups are expensive), it makes sense to > write more redundant data rather than to minimize loss through exotic > algorithms. Even if RAID is not used, redundant copies may be used on > the same disk to help protect against block read errors. > > Bob > ====================================== > Bob Friesenhahn > [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss