Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Miles Nordin Wed, 27 Aug 2008 18:28:53 -0700

>>>>> "re" == Richard Elling <[EMAIL PROTECTED]> writes:


    re> I really don't know how to please you.

dd from the raw device instead of through ZFS would be better.  If you
could show that you can write data to a sector, and read back
different data, without getting an error, over and over, I'd be
totally stunned.

The netapp paper was different from your test in many ways that make
their claim that ``all drives silently corrupt data sometimes'' more
convincing than your claim that you have ``one drive which silently
corrupts data always and never returns UNC'':

  * not a desktop.  The circumstances were more tightly-controlled,
    and their drive population installed in a repeated way

  * their checksum measurement was better than ZFS's by breaking the
    type of error up into three buckets instead of one, and their
    filesystem more mature, and their filesystem is not already known
    to count CKSUM errors for circumstances other than silent
    corruption, which argues the checksums are less likely to come
    from software bugs

  * they make statistical arguments that at least some of the errors
    are really coming from the drives by showing they have spatial
    locality w.r.t. the LBA on the drive, and are correlated with
    drive age and impending drive failure.

The paper was less convincing in one way:

  * their drives are using nonstandard firmware

    re> Anyone who has been around for a while will have similar
    re> anecdotes.

yeah, you'd think, but my similar anecdote is that (a) I can get UNC's
repeatably on a specific bad sector that persist either forever or
until I write new data to that sector with dd, and do get them on at
least 10% of my drives per year, and (b) I get CKSUM errors from ZFS
all the time with my iSCSI ghetto-SAN and with an IDE/Firewire mirror,
often from things I can specifically trace back to
not-a-drive-failure, but so far never from something I can for certain
trace back to silent corruption by the disk drive.

I don't doubt that it happens, but CKSUM isn't a way to spot it.  ZFS
may give me a way to stop it, but it doesn't give me an accurate way
to measure/notice it.

    re> Indeed.  Intuitively, the AFR and population is more easily
    re> grokked by the masses.

It's nothing to do with masses.  There's an error in your math.  It's
not right under any circumstance.

Your point that a 100 drive population has bad/high odds of having
silent corruption within a year isn't diminished by the correction,
but it would be nice if you would own up to the statistics mistake
since we're taking you at your word on a lot of other statistics.

    >> so, it ends up being a freeze.

    re> Untrue.  There are disks which will retry forever.

I don't understand.  ZFS freezes until the disk stops retrying and
returns an error.  Because some disks never stop retrying and never
return an error, just lock up until they're power-cycled, it's untrue
that ZFS freezes?  I think either you or I have lost the thread of the
argument in our reply chain bantering.

    re> please file bugs.

k., I filed the NFS bug, but unfortunately I don't have output to cut
and paste into it.  glad to see the 'zpool status' bug is there
already and includes the point that lots of other things are probably
hanging which shouldn't.

pgpLqP4hhsaWo.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Reply via email to