>>>>> "re" == Richard Elling <richard.ell...@gmail.com> writes:

    re> The probability of the garbage having both a valid fletcher2
    re> checksum at the proper offset and having the proper sequence
    re> number and having the right log chain link and having the
    re> right block size is considerably lower than the weakness of
    re> fletcher2.

I'm having trouble parsing this.  I think you're confusing a few 
different failure modes:

 * ZIL entry is written, but corrupted by the storage, so that, for
   example, an entry should be read from the mirrored ZIL instead.

   + broken fletcher2 detects the storage corruption
     CASE A: Good!

   + broken fletcher2 misses the error, so that corrupted data is
     replayed from ZIL into the proper pool, possibly adding a
     stronger checksum to the corrupt data while writing it.
     CASE B: Bad!

   + broken fletcher2 misinterprets storage corruption as signalling
     the end of the ZIL, and any data in the ZIL after the corrupt
     entry is truncated without even attempting to read the mirror.
     (does this happen?)
     CASE C: Bad!

 * ZIL entry is intentional garbage, either a partially-written entry
   or an old entry, and should be treated as the end of the ZIL

   + broken fletcher2 identifies the partially written entry by a
     checksum mismatch, or the sequence number identifies it as old
     CASE D: Good!

   + broken fletcher2 misidentifies a partially-written entry as
     complete because of a hash collision
     CASE E: Bad!

   + (hypothetical, only applies to non-existent fixed system) working
     fletcher2 or broken-good-enough fletcher4 misidentifies a
     partially-written entry as complete because of a hash collision
     CASE F: Bad!

If I read your sentence carefully and try to match it with this chart,
it seems like you're saying P(CASE F) << P(CASE E), which seems like
an argument for fixing the checksum.  While you don't say so, I
presume from your other posts you're trying to make a case for doing
nothing, so I'm confused.

I was mostly thinking about CASE B though.  It seems like the special
way the ZIL works has nothing to do with CASE B: if you send data
through the ZIL to a sha256 pool, it can be written to ZIL under
broken-fletcher2, corrupted by the storage, and then read in and
played back corrupt but covered with a sha256 checksum to the pool
proper.  AFAICT your relative-probability sentence has nothing to do
with CASE B.

    re> Unfortunately, the ZIL is also latency sensitive, so the
    re> performance case gets stronger 

The performance case advocating what?  not fixing the broken checksum?

    re> while the additional error checking already boosts the
    re> dependability case.

what additional error checking?

Isn't the whole specialness of the ZIL that the checksum is needed in
normal operation, absent storage subsystem corruption, as I originally
said?  It seems like the checksum's strength is more important here,
not less.

Attachment: pgpMTzwhPNdUa.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to