>>>>> "tt" == Toby Thain <[EMAIL PROTECTED]> writes:

    tt> Why would it be assumed to be a bug in Solaris? Seems more
    tt> likely on balance to be a problem in the error reporting path
    tt> or a controller/ firmware weakness.

It's not really an assumption.  It's been discussed in here a lot, and
we know why it's happening.  It's just a case of ``it's a feature not
a bug'' combined with ``somebody else's problem.''

The error-reporting path you mention is inside Solaris, so I have a
little trouble decoding your statement.

I wish drives had a failure-aware QoS with a split queue for
aggressive-retry cdb's and deadline cdb's.  This would make the
B_FAILFAST primitive the Solaris developers seem to believe in
actually mean something.

Solaris is supposed to have a B_FAILFAST option for block I/O that ZFS
could start using to capture vdev-level knowledge like ``don't try too
hard to read this block from one device, because we can get it faster
by asking another device.''  In the real world B_FAILFAST is IMO quite
silly and, not exactly useless but at best deceptive to the
higher-layer developer, because even IF the drive could be told to
fail faster than 30 seconds by some future fancier sd driver, there
would still be some fail-slow cdbs hitting the drive, and the
two can't be parallelized.  Sending a fail-slow cdb to a drive
freezes the drive for up to 30 seconds * <n>, where <n> is the
multiplier of some cargo-cult state machine built into the host
adapter driver involving ``bus resets'' and other such stuff.  All the
B_FAILFAST cdbs queued behind the fail-slow may as well forget
the flag becasue the drive's busy with the slow cdb.  If you
have a very few of these retryable cdbs peppered into your
transaction stream, which are expected to take 10 - 100ms each but
actually take one or two MINUTES each, the drive will be so slow it'd
be more expressive to mark it dead.  What will probably happen in
$REALITY is, the sysadmin will declare his machine ``frozen without a
panic message'' and reboot it, losing any write-cached data which, if
not for this idiocy, could have been committed to other drives in a
redundant vdev, as well as rebooting the rest of the system unrelated
to this stuck zpool.

However, it's inappropriate for a driver to actually report ``drive
dead'' in this scenario, because the drive is NOT dead.  The
drive-failure-statistic papers posted in here say that drives usually
fail with a bunch of contiguous or clumped-together unreadable
sectors.  You can still get most of the data off them with dd_rescue
or 'dd if=baddrive of=gooddrive bs=512 conv=noerror,sync', if you wait
about a week.  About four hours of that week is spent copying data and
the rest spent aggressively ``retrying''.

An instantanious manual command, ``I insist this drive is failed.
Mark it failed instantly, without leaving me stuck in bogus state
machines for two minutes or two hours,'' would be a huge improvement,
but I think graceful automatic behavior is not too much to wish for
because this isn't some strange circumstance.  This is *the way drives
usually fail*.

SCSI drives have all kinds of retry-tuning in the ``mode pages'' in a
standardized format.  Even 5.25" 10MB SCSI drives had these pages.
One of NetApp's papers said they don't even let their SCSI/FC drives
do their own bad-block reallocation.  They do all that in host
software.  so there are a lot of secret tuning knobs, and they're AIUI
largely standardized across manufacturers and through the years.  ATA
drives, AIUI, don't have the pages, but some WD gamer drives have some
goofy DOS RAID-tuner tool.

But even what SCSI drives offer isn't enough to provide the
architecture ZFS seems to dream ov.  What's really needed to provide
ZFS developer's expectations of B_FAILFAST is QoS inside the drive
firmware.  Drives need to have split queues, with an aggressive-retry
queue and a deadline-service queue.  While retrying a stuck
cdb in the aggressive queue, they must keep servicing the
deadline queue.  I've never heard of anything like this existing in a
real drive.  I think it's beyond the programming skill of an
electrical engineer, and it may be too constraining for them because
drives seem to do spastic head-seeks and sometimes partway spin
themselves down and back up during a retry cycle.

ZFS still seems to have this taxonomic-arcania view of drives that
they are ``failing operations'' or the drive itself is ``failed''.  It
belongs to the driver's realm to decide whether it's the whole drive
or just the ``operation'' which is failing, because that's how the
square peg fits snugly into it's square hole.

One of the NetApp papers mentions they have proprietary statistical
heuristics for when to ignore a drive for a little while and use
redundant drives instead, and when to fail a drive and call
autosupport.  And they log drive behavior really explicitly and
unambiguously separate from ``controller'' failure, which is why
they're able to write the paper at all.  I'm in favour of heuristics,
but most of the ZFS developers seem to think the issue lies with every
driver in Solaris being not up to its promised standards.

I still think the ZFS approach is wrong and the Netapp approach right.

 * I think if SATA is to be supported, then the fantasy that drives
   can be configured to return failure early should be cast off
   forever.

 * I don't think ZFS will match the availability behavior of Netapp or
   even of Areca/PERC/RAID-on-a-card until it includes vdev-level
   handling of slow devices.  This means vdev-level timers inside ZFS,
   above the block driver level, driving error-recovery decisions.

 * I think a pool read/write that takes longer than other drives in a
   redundant vdev, or longer than other cdb's took on the same
   drive, should be re-dispatched to fetch redundant data.  I think
   this should happen with really tight tolerance and should be
   stateful, such that a mirror could have a remote iSCSI component
   and a local component, and only the local component would be used
   for reads.

 * If a drive is taking 30 seconds do perform every cdb, but is still
   present and the driver refuses to mark it bad, ZFS needs to be able
   to mark it bad on its own, so that it no longer blocks synchronous
   writes, and so hot-spare replacement can start to get the pool back
   up to policy's redundancy expectation.  If we're designing systems
   with multiple controllers to avoid a ``single point of failure''
   then it's not okay to punt and say, well this isn't our problem
   because we're waiting patiently on the controller to do something
   sane.  The short-term decisions require vdev-level knowledge which
   doesn't exist inside the driver, but arguably marking drives failed
   does not require vdev-level knowledge and could be done in the
   driver rather than ZFS.  I still think this is wrong.  Based on our
   experience so far with controller drivers, they aren't very good,
   and controller chips are rather short-lived so they're never going
   to be very good, and the drivers are often proprietary so the work
   has to be redone inside ZFS just to have a bit of software freedom
   again.  A practical modern storage design is robust against bugs in
   the controller driver, bugs exercised by combinations of drive
   firmware and controllers or by doing strange things with cables.

   If this won't go inside ZFS, then people will reasonably want some
   pseudodevice like an SVM soft partition or a multipath layer to
   protect them from failing controller drivers.  They might want a
   way to manually, and instantly, without waiting on stupid state
   machines, mark the device failed, crack ZFS and the controller
   driver apart so they're not locked in some deadly embrace of
   failure that requires rebooting.  If we agree there is a need for
   multipath to a single device, why can we not agree that we expect
   protection from failures of a controller or its driver even when we
   don't have multipath but have laid out our vdev's with enough
   redundancy to tolerate controller failure?  

   In practice, I think drives that become really slow instead of
   failing outright is the real problem, but bringing in multipath and
   controller redundancy shows what is to my view the taxonomic
   hypocricy of wanting to keep this out of ZFS.

 * Management commands like export, status, detach, offline, replace,
   must either (a) never block waiting for I/O.  use kernel state
   only, and do disk writes asynchronously reporting failure through
   inspection commands that the user polls like 'zpool status'.  This
   world is possible---we don't expect the mirror to be in sync before
   'zpool attach' returns though we could.  Or (b) sleep
   interruptably, and include a more drastic version that doesn't
   block, so normally you type 'zpool offline' and when the prompt
   returns without error, you know that all your labels are updated.
   But if you don't get a prompt back, you can ^C and 
   'zpool offline -f'.

   Not being able to get rid of a drive without access to the drive
   you want to get rid of is as ridiculous as ``keyboard not found.
   Press F1 to continue.''  Even square-peg square-hole taxonomists
   ought to agree on this one.  And I don't like getting ``no valid
   replicas'' errors to situations that ZFS will tolerate if you force
   it by rebooting or by hot-unplugging the device---there should be
   clear deliniation of which pool states are acceptable and which are
   not, and I should be able to explore all the acceptable states by
   moving the pool through them manually.  If I can't 'zpool offline'
   a device, and _instantly_ if I insist on it, then the pool should
   not mount at boot without that device.  I shouldn't have to involve
   rebooting in my testing, or else it feels like fisher price
   wallpapered crap.  I sometimes run my dishwasher with the door open
   for a half second when I become suspicious of it.  The sky doesn't
   fall.  but these days it seems like people believe any interlock
   anywhere, even a preposterous invented one, is as sacred as the one
   on a microwave or a UV oven.  oh, and when possible ZFS should not
   forget its knowledge of inconsistencies across a reboot, and should
   for example continue interrupted resilvers like SVM did.

Attachment: pgp0dmsGfvvvl.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to