>>>>> "re" == Richard Elling <richard.ell...@gmail.com> writes:
re> The probability of the garbage having both a valid fletcher2 re> checksum at the proper offset and having the proper sequence re> number and having the right log chain link and having the re> right block size is considerably lower than the weakness of re> fletcher2. I'm having trouble parsing this. I think you're confusing a few different failure modes: * ZIL entry is written, but corrupted by the storage, so that, for example, an entry should be read from the mirrored ZIL instead. + broken fletcher2 detects the storage corruption CASE A: Good! + broken fletcher2 misses the error, so that corrupted data is replayed from ZIL into the proper pool, possibly adding a stronger checksum to the corrupt data while writing it. CASE B: Bad! + broken fletcher2 misinterprets storage corruption as signalling the end of the ZIL, and any data in the ZIL after the corrupt entry is truncated without even attempting to read the mirror. (does this happen?) CASE C: Bad! * ZIL entry is intentional garbage, either a partially-written entry or an old entry, and should be treated as the end of the ZIL + broken fletcher2 identifies the partially written entry by a checksum mismatch, or the sequence number identifies it as old CASE D: Good! + broken fletcher2 misidentifies a partially-written entry as complete because of a hash collision CASE E: Bad! + (hypothetical, only applies to non-existent fixed system) working fletcher2 or broken-good-enough fletcher4 misidentifies a partially-written entry as complete because of a hash collision CASE F: Bad! If I read your sentence carefully and try to match it with this chart, it seems like you're saying P(CASE F) << P(CASE E), which seems like an argument for fixing the checksum. While you don't say so, I presume from your other posts you're trying to make a case for doing nothing, so I'm confused. I was mostly thinking about CASE B though. It seems like the special way the ZIL works has nothing to do with CASE B: if you send data through the ZIL to a sha256 pool, it can be written to ZIL under broken-fletcher2, corrupted by the storage, and then read in and played back corrupt but covered with a sha256 checksum to the pool proper. AFAICT your relative-probability sentence has nothing to do with CASE B. re> Unfortunately, the ZIL is also latency sensitive, so the re> performance case gets stronger The performance case advocating what? not fixing the broken checksum? re> while the additional error checking already boosts the re> dependability case. what additional error checking? Isn't the whole specialness of the ZIL that the checksum is needed in normal operation, absent storage subsystem corruption, as I originally said? It seems like the checksum's strength is more important here, not less.
pgpMTzwhPNdUa.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss