As most of the zfs recovery problems seem to stem from zfs’s own strict insistence that data be ideally consistent with its corresponding checksum, which of course is good when correspondingly consistent data may be recovered from somewhere, but catastrophic otherwise; it seem clear that zfs must support an inherent worst-case recovery mechanism to allow as much of the file-system to be brought back on line as possible with speculatively recovered files/blocks correspondingly marked as being potentially compromised such that they may be subsequently further scrutinized as may be desired.
In the circumstance when inconsistent data is been returned from storage without any error otherwise, it seems likely that the data was subject to a soft-error somewhere in its journey, therefore it seems (in order): - first both the presumed checksum/indexes and data should be re-read in case the actual error occurred during/after its retrieval from storage. - if that doesn’t work, then it’s corruption would seem to have most likely occurred prior to being stored (as error detection/correction schemes utilized by disk drives are fairly good at not misidentifying corrupted data as being otherwise); thereby implying that its a good bet either the parent or child of the blocks correspondingly containing the checksum and subsequent data may most likely contain a single bit error, and thereby may be possibly recovered by iterating through all possible 1-bit differences in the checksum and data, or block pointers and corresponding child blocks to try to determine if any then satisfy the newly computed check sum requirement. (and correspondingly mark the nodes such that subsequent more comprehensive file system consistency checks may be performed) - although errors may have occurred causing the wrong blocks to have been written and/or multi-bit errors may have occurred during transmission; it seems unlikely to try to exhaustively continue further searching for candidates, and likely simply best to just mark the terminal block and corresponding parent file being likely corrupt and allow some other tool to attempt user piloted file fragment recovery. ZFS’s stringent constancy requirements are very nice, but as data is subject to soft errors throughout it’s transport/storage/use, a file system must be capable of at least attempting to recover from that is reasonably recoverable, as sh*t will always happen, and catastrophic failure should be avoidable at all reasonable costs. This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss