Re: [zfs-discuss] zfs related google summer of code ideas - your vote

Miles Nordin Thu, 05 Mar 2009 11:04:03 -0800

>>>>> "gm" == Gary Mills <mi...@cc.umanitoba.ca> writes:


    gm> There are many different components that could contribute to
    gm> such errors.

yes of course.

    gm> Since only the lower ZFS has data redundancy, only it can
    gm> correct the error.

um, no?

An example already pointed out: kerberized NFS will detect network
errors that sneak past the weak TCP checksum, and resend the data.
This will work even on an unredundant, unchecksummed UFS filesystem to
correct network-induced errors.  There is no need for NFS to
``inform'' UFS so that UFS can use ``redundancy'' to ``correct the
error''.  UFS never hears anything, and doesn't have any redundancy.
NFS resends the data.  done.

iSCSI also has application-level CRC's, seperately enableable for
headers and data.  not sure what FC has.

It doesn't make any sense to me that some higher layer would call back
to the ZFS stack on the bottom, and tell it to twiddle with disks
because there's a problem with the network.

An idea Richard brought up months ago was ``protection domains,'' that
it might be good to expose ZFS checksums to higher levels to stretch a
single protection domain as far as possible upwards in the stack.

Application-level checksums also form a single protection domain, for
_reading_.  Suppose corruption happens in RAM or on the network (where
the ZFS backing store cannot detect it), while reading a gzip file on
an NFS client.  gzip will always warn you!  This is end-to-end, and
will warn you just as perfectly as hypothetical end-to-end
networkified-ZFS.  The problem: there's no way for gzip to ``retry''.
You can run gunzip again, but it will just fail again and again
because the file with network-induced errors is cached on the NFS
client.  It's the ``cached badness'' problem Richard alluded to.  You
would have to reboot the NFS client to clear its read cache, then try
gunzip again.  This is probably good enough in practice, but it sounds
like there's room for improvement.

It's irrelevant in this scenario that the lower ZFS has
``redundancy''.  All you have to do to fix the problem is resend the
read over the network.  What would be nice to have, that we don't
have, is a way of keeping ZFS block checksums attached to the data as
it travels over the network until it reaches the
something-like-an-NFS-client.  Each part of the stack that caches data
could be trained to either (1) validate ZFS block checksums, or (2) to
obey ``read no-cache'' commands passed down from the layer above.  In
the application-level gzip example, gzip has no way of doing (2), so
extending the protection domain upward rather than pushing
cache-flushing obedience downward seems more practical.

For writing, application-level checksums do NOT work at all, because
you would write corrupt data to the disk, and notice only later when
you read it back, when there's nothing you can do to fix it.  ZFS
redundancy will not help you here either, because you write corrupt
data redundantly!  With a single protection domain for writing, the
write would arrive at ZFS along with a never-regenerated checksum
wrapper-seal attached to it by the something-like-an-NFS-client.  Just
before ZFS sends the write to the disk driver, ZFS would crack the
protection domain open, validate the checksum, reblock the write, and
send it to disk with a new checksum.  (so, ``single protection
domain'' is really a single domain for reads, and two protection
domains for write) If the checksum does not match, ZFS must convince
the writing client to resend---in the write direction I think cached
bad data will be less of a problem.

I think a single protection domain, rather than the currently
best-obtainable which is sliced domains where the slices butt up
against each other as closely as possible, is an idea with merit.  but
it doesn't have anything whatsoever to do with the fact ZFS stores
things redundantly on the platters.  The whole thing would have just
as much merit, and fix the new problem classes it addresses just as
frequently!, for single-disk vdev's as for redundant vdev's.

    gm> Of course, if something in the data path consistently corrupts
    gm> the data regardless of its origin, it won't be able to correct
    gm> the error.

TCP does this all the time, right?  see, watch this:  +++ATH0

:)

that aside, your idea of ``the error'' seems too general, like the
annoying marketing slicks with the ``healing'' and ``correcting''
stuff.  stored, transmitted, cached errors are relevantly different,
which also means corruption in the read and write directions are
different.

pgpsQ6XuAcq9p.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs related google summer of code ideas - your vote

Reply via email to