>>>>> "bf" == Bob Friesenhahn <[EMAIL PROTECTED]> writes:

    bf> If the system or device is simply overwelmed with work, then
    bf> you would not want the system to go haywire and make the
    bf> problems much worse.

None of the decisions I described its making based on performance
statistics are ``haywire''---I said it should funnel reads to the
faster side of the mirror, and do this really quickly and
unconservatively.  What's your issue with that?

    bf> You are saying that I can't split my mirrors between a local
    bf> disk in Dallas and a remote disk in New York accessed via
    bf> iSCSI?

nope, you've misread.  I'm saying reads should go to the local disk
only, and writes should go to both.  See SVM's 'metaparam -r'.  I
suggested that unlike the SVM feature it should be automatic, because
by so being it becomes useful as an availability tool rather than just
performance optimisation.

The performance-statistic logic should influence read scheduling
immediately, and generate events which are fed to FMA, then FMA can
mark devices faulty.  There's no need for both to make the same
decision at the same time.  If the events aren't useful for diagnosis,
ZFS could not bother generating them, or fmd could ignore them in its
diagnosis.  I suspect they *would* be useful, though.

I'm imagining the read rescheduling would happen very quickly, quicker
than one would want a round-trip from FMA, in much less than a second.
That's why it would have to compare devices to others in the same
vdev, and to themselves over time, rather than use fixed timeouts or
punt to haphazard driver and firmware logic.

    bf>    o System waits substantial time for devices to (possibly)
    bf> recover in order to ensure that subsequently written data has
    bf> the least chance of being lost.

There's no need for the filesystem to *wait* for data to be written,
unless you are calling fsync.  and maybe not even then if there's a
slog.

I said clearly that you read only one half of the mirror, but write to
both.  But you're right that the trick probably won't work
perfectly---eventually dead devices need to be faulted.  The idea is
that normal write caching will buy you orders of magnitude longer time
in which to make a better decision before anyone notices.

Experience here is that ``waits substantial time'' usually means
``freezes for hours and gets rebooted''.  There's no need to be
abstract: we know what happens when a drive starts taking 1000x -
2000x longer than usual to respond to commands, and we know that this
is THE common online failure mode for drives.  That's what started the
thread.  so, think about this: hanging for an hour trying to write to
a broken device may block other writes to devices which are still
working, until the patiently-waiting data is eventually lost in the
reboot.

    bf>    o System immediately ignores slow devices and switches to
    bf> non-redundant non-fail-safe non-fault-tolerant
    bf> may-lose-your-data mode.  When system is under intense load,
    bf> it automatically switches to the may-lose-your-data mode.

nobody's proposing a system which silently rocks back and forth
between faulted and online.  That's not what we have now, and no such
system would naturally arise.  If FMA marked a drive faulty based on
performance statistics, that drive would get retired permanently and
hot-spare-replaced.  Obviously false positives are bad, just as
obviously as freezes/reboots are bad.

It's not my idea to use FMA in this way.  This is how FMA was pitched,
and the excuse for leaving good exception handling out of ZFS for two
years.  so, where's the beef?

Attachment: pgpUDw139jf6A.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to