Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Miles Nordin Wed, 27 Aug 2008 14:53:15 -0700

>>>>> "re" == Richard Elling <[EMAIL PROTECTED]> writes:


    >> If you really mean there are devices out there which never
    >> return error codes, and always silently return bad data, please
    >> tell us which one and the story of when you encountered it,

    re> I blogged about one such case.
    re> http://blogs.sun.com/relling/entry/holy_smokes_a_holey_file

    re> However, I'm not inclined to publically chastise the vendor or
    re> device model.  It is a major vendor and a popular
    re> device. 'nuff said.

It's not really enough for me, but what's more the case doesn't match
what we were looking for: a device which ``never returns error codes,
always returns silently bad data.''  I asked for this because you said
``However, not all devices return error codes which indicate
unrecoverable reads,'' which I think is wrong.  Rather, most devices
sometimes don't, not some devices always don't.

Your experience doesn't say anything about this drive's inability to
return UNC errors.  It says you suspect it of silently returning bad
data, once, but your experience doesn't even clearly implicate the
device once: It could have been cabling/driver/power-supply/zfs-bugs
when the block was written.  I was hoping for a device in your ``bad
stack'' which does it over and over.

Remember, I'm not arguing ZFS checksums are worthless---I think
they're great.  I'm arguing with your original statement that ZFS is
the only software RAID which deals with the dominant error you find in
your testing, unrecoverable reads.  This is untrue!

    re> This number should scare the *%^ out of you.  It basically
    re> means that no data redundancy is a recipe for disaster.

yeah, but that 9.5% number alone isn't an argument for ZFS over other
software LVM's.

    re> 0.466%/yr is a per-disk rate.  If you have 10 disks, your
    re> exposure is 4.6% per year.  For 100 disks, 46% per year, etc.

no, you're doing the statistics wrong, and in a really elementary way.
You're counting multiple times the possible years in which more than
one disk out of the hundred failed.  If what you care about for 100
disks is that no disk experiences an error within one year, then you
need to calculate

  (1 - 0.00466) ^ 100 = 62.7%

so that's 37% probability of silent corruption.  For 10 disks, the
mistake doesn't make much difference and 4.6% is about right.

I don't dispute ZFS checksums have value, but the point stands that
the reported-error failure mode is 20x more common in netapp's study
than this one, and other software LVM's do take care of the more
common failure mode.

    re> UNCs don't cause ZFS to freeze as long as failmode != wait or
    re> ZFS manages the data redundancy.

The time between issuing the read and getting the UNC back can be up
to 30 seconds, and there are often several unrecoverable sectors in a
row as well as lower-level retries multiplying this 30-second value.
so, it ends up being a freeze.

To fix it, ZFS needs to dispatch read requests for redundant data if
the driver doesn't reply quickly.  ``Quickly'' can be ambiguous, but
the whole point of FMD was supposed to be that complicated statistics
could be collected at various levels to identify even more subtle
things than READ and CKSUM errors, like drives that are working at
1/10th the speed they should be, yet right now we can't even flag a
drive taking 30 seconds to read a sector.  ZFS is still ``patiently
waiting'', and now that FMD is supposedly integrated instead of a
discussion of what knobs and responses there are, you're passing the
buck to the drivers and their haphazard nonuniform exception state
machines.  The best answer isn't changing drivers to make the drive
timeout in 15 seconds instead---it's to send the read to other disks
quickly using a very simple state machine, and start actually using
FMD and a complicated state machine to generate suspicion-events for
slow disks that aren't returning errors.

Also the driver and mid-layer need to work with the hypothetical
ZFS-layer timeouts to be as good as possible about not stalling the
SATA chip, the channel if there's a port multiplier, or freezing the
whole SATA stack including other chips, just because one disk has an
outstanding READ command waiting to get an UNC back.  

In some sense the disk drivers and ZFS have different goals.  The goal
of drivers should be to keep marginal disk/cabling/... subsystems
online as aggressively as possible, while the goal of ZFS should be to
notice and work around slightly-failing devices as soon as possible.
I thought the point of putting off reasonable exception handling for
two years while waiting for FMD, was to be able to pursue both goals
simultaneously without pressure to compromise one in favor of the
other.

In addition, I'm repeating myself like crazy at this point, but ZFS
tools used for all pools like 'zpool status' need to not freeze when a
single pool, or single device within a pool, is unavailable or slow,
and this expectation is having nothing to do with failmode on the
failing pool.  And NFS running above ZFS should continue serving
filesystems from available pools even if some pools are faulted, again
nothing to do with failmode.

Neither is the case now, and it's not a driver fix, but even beyond
fixing these basic problems there's vast room for improvement, to
deliver something better than LVM2 and closer to NetApp, rather than
just catching up.

pgpzRuZrBmOeE.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Reply via email to