Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Miles Nordin Fri, 29 Aug 2008 14:16:32 -0700

>>>>> "es" == Eric Schrock <[EMAIL PROTECTED]> writes:


    es> The main problem with exposing tunables like this is that they
    es> have a direct correlation to service actions, and
    es> mis-diagnosing failures costs everybody (admin, companies,
    es> Sun, etc) lots of time and money.  Once you expose such a
    es> tunable, it will be impossible to trust any FMA diagnosis,

Yeah, I tend to agree that the constants shouldn't be tunable, becuase
I hoped Sun would become a disciplined collection-point for experience
to set the constants, discipline meaning the constants are only
adjusted in response to bad diagnosis not ``preference,'' and in a
direction that improves diagnosis for everyone, not for ``the site''.

I'm not yet won over to the idea that statistical FMA diagnosis
constants shouldn't exist.  I think drives can't diagnose themselves
for shit, and I think drivers these days are diagnosees, not
diagnosers.  But clearly a confusingly-bad diagnosis is much worse
than diagnosis that's bad in a simple way.

    es> If I issue a write to both halves of a mirror, should
    es> I return when the first one completes, or when both complete?

well, if it's not a synchronous write, you return before you've
written either half of the mirror, so it's only an issue for
O_SYNC/ZIL writes, true?

BTW what does ZFS do right now for synchronous writes to mirrors, wait
for all, wait for two, or wait for one?

    es> any such "best effort RAS" is a little dicey because you have
    es> very little visibility into the state of the pool in this
    es> scenario - "is my data protected?" becomes a very difficult
    es> question to answer.

I think it's already difficult.  For example, a pool will say ONLINE
while it's resilvering, won't it?  I might be wrong.  

Take a pool that can only tolerate one failure.  Is the difference
between replacing an ONLINE device (still redundant) and replacing an
OFFLINE device (not redundant until resilvered) captured?  Likewise,
should a pool with a spare in use really be marked DEGRADED both
before the spare resilvers and after?

The answers to the questions aren't important so much as that you have
to think about the answers---what should they be, what are they
now---which means ``is my data protected?'' is already a difficult
question to answer.  

Also there were recently fixed bugs with DTL.  The status of each
device's DTL, even the existence and purpose of the DTL, isn't
well-exposed to the admin, and is relevant to answering the ``is my
data protected?''  question---indirect means of inspecting it like
tracking the status of resilvering seem too wallpapered given that the
bug escaped notice for so long.

I agree with the problem 100% and don't wish to worsen it, just
disagree that it's a new one.

    re> 3 orders of magnitude range for magnetic disk I/Os, 4 orders
    re> of magnitude for power managed disks.

I would argue for power management a fixed timeout.  The time to spin
up doesn't have anything to do with the io/s you got before the disk
spun down.  There's no reason to disguise the constant for which we
secretly wish inside some fancy math for deriving it just because
writing down constants feels bad.

unless you _know_ the disk is spinning up through some in-band means,
and want to compare its spinup time to recorded measurements of past
spinups.


This is a good case for pointing out there are two sets of rules:

 * 'metaparam -r' rules

   + not invoked at all if there's no redundancy.

   + very complicated

     - involve sets of disks, not one disk.  comparison of statistic
       among disks within a vdev (definitely), and comparison of
       individual disks to themselves over time (possibly).

     - complicated output: rules return a set of disks per vdev, not a
       yay-or-nay diagnosis per disk.  And there are two kinds of
       output decision:

       o for n-way mirrors, select anywhere from 1 to n disks.  for
         example, a three-way mirror with two fast local mirrors, one
         slow remote iSCSI mirror, should split reads among the two
         local disks.

         for raidz and raidz2 they can eliminate 0, 1 (,or 2) disks
         from the read-us set.  It's possible to issue all the reads
         and take the first sufficient set to return as Anton
         suggested, but I imagine 4-device raidz2 vdevs will be common
         which could some day perform as well as a 2-device mirror.

       o also, decide when to stop waiting on an existing read and
         re-issue it.  so the decision is not only about future reads,
         but has to cancel already-issued reads, possibly replacing
         the B_FAILFAST mechanism so there will be a second
         uncancellable round of reads once the first round exhausts
         all redundancy.

       o that second decision needs to be made thousands of times per
         second without a lot of CPU overhead

   + small consequence if the rules deliver false-positives, just
     reduced performance (which is the same with the TCP
     fast-retransmit rules Bill mentioned)

   + large consequence for false-negatives (system freeze), so one
     can't really say, ``we won't bother doing it for raidz2 because it's
     too complicated.''  The rules are NOT just about optimizing performance.

   + at least partly in kernel


 * diagnosis rules

   + should it be invoked for single-device vdev's?  Does ZFS
     diagnosis already consider that a device in an unredundant vdev
     should be FAULTED less aggressively (ex., never for CKSUM
     errors)?  this is arguable.

   + diagnosis is strictly per-disk and should compare disks
     only to themselves, or to cultural memory of The Typical Disk in
     the form of untunable constants, never others in the same vdev

   + three possible verdicts per disk:

     - all's good

     - warn the sysadmin about this disk but keep writing to it

     - fault this disk in ZFS.  no further I/O, not even writes, and
       start rebuilding it onto a spare

     Erik points out that false positives are expensive in BOTH cases,
     not just the second, because even the warning can initiate
     expensive repare procedures and reduce trust in FMA diagnoses.

     so, there should probably be only two verdicts, good and fault.

     If the statistics are extractable, more aggressive sysadmins can
     devise their own warning rules and competitively try to predict
     the future.  The owners of large clusters might be better at
     crafting warning rules than Sun, but their results won't be
     general.

   + potentially complicated, but might be really simple, like ``an
     I/O takes more than three minutes to complete.''

   + A more complicated but still somewhat simple hypothetical rule:
     ``one I/O hasn't returned completion or failure after 10 mintues,
     OR at least one I/O originally issued to the driver from within
     each of three separate four-minute-long buckets within the last
     40 minutes takes 1000 times longer than usual or more than 120
     seconds, whichever is larger (three slow I/O's in recent past)''

     These might be really bad rules.  my point is that variance, or
     some more complicated statistic than addition and buckets, might
     be good for diagnosing bad disks but isn't necessarily required,
     while for the 'metaparam -r' rules it IS required.

     for diagnosing bad disks, a big bag of traditional-AI rules might
     be better than statistical/machine-learning rules, and will be
     easier for less-sophisticated developers to modify according to
     experience and futuristic hardware.

     ex., power-managed disk spinning up takes less than x seconds and
     should not be spinning down more often than every y minutes.  SAN
     fabric disconnection should reconnect within z seconds, and
     unannounced outages don't need to be tolerated silently without
     intervention more than once per day.  u.s.w.  

     It may even be possible to generate negative fault events, like
     ``disk IS replying, not silent, and it says
     Not-ready-coming-ready, so don't fault it for 1 minute.''  The
     option of creating this kind of hairy mess of special-case
     layer-violating codified-tradition rules is the advantage I
     perceived in tolerating the otherwise disgusting hairy bolt-on
     shared-lib-spaghetty mess that is FMA.

     But for the 'metaparam -r' rules OTOH, variance/machine-learning
     is probably the only approach.

   + rules are in userland, can be more expensive CPU-wise, and return
     feedback to the kernel only a couple times a minute, not per-I/O
     like the 'metaparam -r' reissue rules.

I guess I'm changing my story slightly.  I *would* want ZFS to collect
drive performance statistics and report them to FMA, but I wouldn't
suggest reporting the _decision_ outputs of the 
'metaparam -r'-replacement engine to FMA, only the raw stats.

and, of course, ``reporting'' is tricky for the diagnosis case becuase
of the bolted-on separation of FMA.  You can't usefully report ``the
I/O took 3 hours to complete'' because you've now waited three hours
to get the report, and the completed I/O has a normal driver error
attached to it so no fancy statistical decisions are any longer
needed.  Instead, you have to make polled reports to userland a couple
times a minute, containing the list of incomplete outstanding I/O's,
along with averages and variances and whatever else.

pgphahWkGxlRt.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Reply via email to