Re: [zfs-discuss] ZFS hangs/freezes after disk failure,

Miles Nordin Mon, 25 Aug 2008 11:58:58 -0700

>>>>> "jcm" == James C McPherson <[EMAIL PROTECTED]> writes:
>>>>> "thp" == Todd H Poole <[EMAIL PROTECTED]> writes:
>>>>> "mh" == Matt Harrison <[EMAIL PROTECTED]> writes:
>>>>> "js" == John Sonnenschein <[EMAIL PROTECTED]> writes:
>>>>> "re" == Richard Elling <[EMAIL PROTECTED]> writes:
>>>>> "cg" == Carson Gaspar <[EMAIL PROTECTED]> writes:


   jcm> Don't _ever_ try that sort of thing with IDE. As I mentioned
   jcm> above, IDE is not designed to be able to cope with [unplugging
   jcm> a cable]

It shouldn't have to be designed for it, if there's controller
redundancy.  On Linux, one drive per IDE bus (not using any ``slave''
drives) seems like it should be enough for any electrical issue, but
is not quite good enough in my experience, when there are two PATA
busses per chip.  but one hard drive per chip seems to be mostly okay.
In this SATA-based case, not even that much separation was necessary
for Linux to survive on the same hardware, but I agree with you and
haven't found that level with PATA either.

OTOH, if the IDE drivers are written such that a confusing interaction
with one controller chip brings down the whole machine, then I expect
the IDE drivers to do better.  If they don't, why advise people to buy
twice as much hardware ``because, you know, controllers can also fail,
so you should have some controller redundancy''---the advice is worse
than a waste of money, it's snake oil---a false sense of security.

   jcm> You could start by taking us seriously when we tell you that
   jcm> what you've been doing is not a good idea, and find other ways
   jcm> to simulate drive failures.

well, you could suggest a method.

except that the whole point of the story is, Linux, without any
blather about ``green-line'' and ``self-healing,'' without any
concerted platform-wide effort toward availability at all, simply
works more reliably.

   thp> So aside from telling me to "[never] try this sort of thing
   thp> with IDE" does anyone else have any other ideas on how to
   thp> prevent OpenSolaris from locking up whenever an IDE drive is
   thp> abruptly disconnected from a ZFS RAID-Z array?

yeah, get a Sil3124 card, which will run in native SATA mode and be
more likely to work.  Then, redo your test and let us know what
happens.

The not-fully-voiced suggestion to run your ATI SB600 in native/AHCI
mode instead of pci-ide/compatibility mode is probably a bad one
because of bug 6665032: the chip is only reliable in compatibility
mode.  You could trade your ATI board for an nVidia board for about
the same price as the Sil3124 add-on card.  AIUI from Linux wiki:

 http://ata.wiki.kernel.org/index.php/SATA_hardware_features

...says the old nVidia chips use nv_sata driver, and the new ones use
the ahci driver, so both of these are different from pci-ide and more
likely to work.  Get an old one (MCP61 or older), and a new one (MCP65
or newer), repeat your test and let us know what happens.

If the Sil3124 doesn't work, and nv_sata doesn't work, and AHCI on
newer-nVidia doesn't work, then hook the drives up to Linux running
IET on basically any old chip, and mount them from Solaris using the
built-in iSCSI initiator.

If you use iSCSI, you will find: 

you will get a pause like with NT.  Also, if one of the iSCSI targets
is down, 'zpool status' might hang _every time_ you run it, not just
the first time when the failure is detected.  The pool itself will
only hang the first time.  Also, you cannot boot unless all iSCSI
targets are available, but you can continue running if some go away
after booting.  

Overall IMHO it's not as good as LVM2, but it's more robust than
plugging the drives into Solaris.  It also gives you the ability to
run smartctl on the drives (by running it natively on Linux) with full
support for all commands, while someone here who I told to run
smartctl reported that on Solaris 'smartctl -a' worked but 'smartctl
-t' did not.  I still have performance problems with iSCSI.  I'm not
sure yet if they're unresolvable: there are a lot of tweakables with
iSCSI, like disabling Nagle's algorithm, and enabling RED on the
initiator switchport, but first I need to buy faster CPU's for the
targets.

    mh> Dying or dead disks will still normally be able to
    mh> communicate with the driver to some extent, so they are still
    mh> "there".

The dead disks I have which don't spin also don't respond to
IDENTIFY(0) so they don't really communicate with the driver at all.
now, possibly, *possibly* they are still responsive after they fail,
and become unresponsive after the first time they're
rebooted---because I think they load part of their firmware off the
platters.  Also, ATAPI standard says that while ``still
communicating'' drives are allowed to take up to 30sec to answer each
command, which is probably too long to freeze a whole system.  and 
still, just because ``possibly,'' it doesn't make sense to replace a
tested-working system with a tested-broken system, not even after
someone tells a complicated story trying to convince you the broken
system is actually secretly working, just completely impossible to
test, so you have to accept it based on stardust and fantasy.

    js> yanking the drives like that can seriously damage the
    js> drives or your motherboard.

no, it can't.

And if I want a software developer's opinion on what will electrically
damage my machine, I'll be sure to let you know first.

   jcm> If you absolutely must do something like this, then please use
   jcm> what's known as "coordinated hotswap" using the cfgadm(1m)
   jcm> command.

   jcm> Viz:

   jcm> (detect fault in disk c2t3d0, in some way)

   jcm> # cfgadm -c unconfigure c2::dsk/c2t3d0 # cfgadm -c disconnect
   jcm> c2::dsk/c2t3d0

so....dont dont DONT do it because its STUPID and it might FRY YOUR
DISK AND MOTHERBOARD.  but, if you must do it, please warn our
software first?

I shouldn't have to say it, but aside from being absurd this
warning-command completely defeats the purpose of the test.

   jcm> Yes, but you're running a new operating system, new
   jcm> filesystem...  that's a mountain of difference right in front
   jcm> of you.

so we do agree that Linux's not freezing in the same scenario
indicates the difference is inside that mountain, which, however
large, is composed entirely of SOFTWARE.

    re> The behavior of ZFS to an error reported by an underlying
    re> device driver is tunable by the zpool failmode property.  By
    re> default, it is set to "wait."

I think you like speculation well enough, so long as it's optimistic.

which is the tunable setting that causes other pools, ones not even
including failed devices, to freeze?

Why is the failmode property involved at all in a pool that still has
enough replicas to keep functioning?

    cg> We really need to fix (B). It seems the "easy" fixes are:

    cg> - Configure faster timeouts and fewer retries on redundant
    cg> devices, similar to drive manufacturers' RAID edition
    cg> firmware. This could be via driver config file, or (better)
    cg> automatically via ZFS, similar to write cache behaviour.

    cg> - Propagate timeouts quickly between layers (immediate soft
    cg> fail without retry) or perhaps just to the fault management
    cg> system

It's also important that things unrelated to the failure aren't
frozen.  This was how I heard the ``green line'' marketing campaign
when it was pitched to me, and I found it really compelling because I
felt Linux had too little of this virtue.  However compelling, I just
don't find it even slightly acquainted with reality.

I can understand ``unrelated'' is a tricky concept when the boot pool
is involved, but for example when it isn't involved: I've had problems
where one exported data pool's becoming FAULTED stops NFS service from
all other pools.  The pool that FAULTED contained no Solaris binaries.

and the zpool status hangs people keep discovering.

I think this is a good test in general: configure two
almost-completely independent stacks through the same kernel:


    NFS export           NFS export

    filesystem           filesystem
    pool                 pool

               ZFS/NFS

    driver               driver

    controller           controller

    disks                disks


Simulate whatever you regard as a ``catastrophic'' or ``unplanned'' or
``really stupid'' failure, and see how big the shared region in the
middle can be without affecting the other stack.  Right now, my
experience is even the stack above does not work.  Maybe mountd gets
blocked or something, I don't know.  Optimistically, we would of
course like this stack below to remain failure-separate:


    NFS export           NFS export

    filesystem           filesystem
    pool                 pool

               ZFS/NFS

               driver

               controller

    disks                disks


The OP is implying, on Linux that stack DOES keep failures separate.
However, even if ``hot plug'' (or ``hot unplug'' for demanding Linux
users) is not supported, at least this stack below should still be
failure-independent:


    NFS export           NFS export

    filesystem           filesystem
    pool                 pool

               ZFS/NFS

               driver

    controller           controller

    disks                disks


I suspect it isn't because the less-demanding stack I started with
isn't failure-independent.  There is probably more than one problem
making these failures spread more widely than they should, but so far
we can't even agree on what we wish were working.

I do think the failures need to be isolated better first, independent
of time.  It's not ``a failure of a drive on the left should propogate
up the stack faster so that the stack on the right unfreezes before
anyone gets too upset.''  The stack on the right shouldn't freeze at
all.

pgpUoJXh9ienx.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS hangs/freezes after disk failure,

Reply via email to