On Jan 1, 2010, at 8:11 AM, R.G. Keen wrote:

On Dec 31, 2009, at 6:14 PM, Richard Elling wrote:
Some nits:
disks aren't marked as semi-bad, but if ZFS has trouble with a
block, it will try to not use the block again. So there is two levels
of recovery at work: whole device and block.
Ah. I hadn't found that yet.

The "one more and you're dead" is really N errors in T time.
I'm interpreting this as "OS/S/zfs/drivers will not mark a disk
as failed until it returns N errors in T time," which means -
check me on this - that to get a second failed disk, the time
to get a second real-or-fake failed disk is T, where T is the
time a second soft-failing disk may happen while the system
is balled up in worrying about the first disk not responding in
T time.

Perhaps I am not being clear.  If a disk is really dead, then
there are several different failure modes that can be responsible.
For example, if a disk does not respond to selection, then it
is diagnosed as failed very quickly. But that is not the TLER
case.  The TLER case is when the disk cannot read from
media without error, so it will continue to retry... perhaps
forever or until reset. If a disk does not complete an I/O operation
in (default) 60 seconds (for sd driver), then it will be reset and
the I/O operation retried.

If a disk returns bogus data (failed ZFS checksum), then the
N in T algorithm may kick in. I have seen this failure mode many
times.

This based on a paper I read on line about the increasing
need for raidz3 or similar over raidz2 or similar because
throughput from disks has not increased concomitantly with
their size; this leading to increasing times to recover from
first failures using the stored checking data in the array to
to rebuild. The notice-an-error time plus the rebuild-the-array
time is the window in which losing another disk, soft or hard,
will lead to the inability to resilver the array.

A similar observation is that the error rate (errors/bit) has not
changed, but the number of bits continues to increase.

For disks which don't return when there is an error, you can
reasonably expect that T will be a long time (multiples of 60
seconds) and therefore the N in T threshold will not be triggered.
The scenario I had in mind was two disks ready to fail, either
soft (long time to return data) or hard (bang! That sector/block
or disk is not coming back, period). The first fails and starts
trying to recover in desktop-disk fashion, maybe taking hours.

Yes, this is the case for TLER. The only way around this is to
use disks that return failures when they occur.

This leaves the system with no error report (i.e. the N-count is
zero) and the T-timer ticking. Meanwhile the array is spinning.
The second fragile disk is going to hit its own personal pothole
at some point soon in this scenario.

What happens next is not clear to me. Is OS/S/zfs going to
suspend disk operations until it finally does hear from first
failing disk 1, based on N still being at 0 because the disk
hasn't reported back yet? Or will the array continue with other
operations, noting that the operation involving failing disk1
has not completed, and either stack another request on
failing disk 1, or access failing disk 2 and get its error too
at some point? Or both?

ZFS issues I/O in parallel. However, that does not prevent an
application or ZFS metadata transactions from waiting on a
sequence of I/O.

If the timeout is truly N errors in T time, and N is never
reported back because the disk spends some hours retrying,
then it looks like this is a zfs hang if not a system hang.

The drivers will retry and fail the I/O. By default, for SATA
disks using the sd driver, there are 5 retries of 60 seconds.
After 5 minutes, the I/O will be declared failed and that info
is passed back up the stack to ZFS, which will start its
recovery.  This is why the T part of N in T doesn't work so
well for the TLER case.

If there is a timeout of some kind which takes place even
if N never gets over 0, that would at least unhang the
file system/system, but it opens you to the second failing
disk fault having occurred, and you're in for another of
either hung-forever or failed-array in the case of raidz.

I don't think the second disk scenario adds value to this
analysis.

The term "degraded" does not have a consistent
definition across the industry.
Of course not! 8-)  Maybe we should use "depraved" 8-)

See the zpool man page for the definition
used for ZFS.  In particular, DEGRADED != FAULTED

Issues are logged, for sure.  If you want to monitor
them proactively,
you need to configure SNMP traps for FMA.
Ok, can deal with that.

It already does this, as long as there are N errors
in T time.
OK, I can work that one out. I'm still puzzled on what
happens with the "N=0 forever" case. The net result
on that one seems to be that  you need raid specific
disks to get some kind of timeout to happen at the
disk level ever (depending on the disk firmware,
which as you note later, is likely to have been written
by a junior EE as his first assignment 8-) )

As above, there is no forever case.  But some folks get impatient
after a few minutes :-)

There is room for improvement here, but I'm not sure how
one can set a rule that would explicitly take care of the I/O never
returning from a disk while a different I/O to the same disk
returns.  More research required here...
Yep. I'm thinking that it might be possible to do a policy-based
setup section for an array where you could select one of a number
of rule-sets for what to do, based on your experience and/or
paranoia about the disks in your array. I had good luck with that
in a primitive whole-machine hardware diagnosis system I worked
with at one point in the dim past. Kind of "if you can't do the
right/perfect thing, then ensure that *something* happens."

One of the rules scenarios might be "if one seek to a disk never
returns and other actions to that disk to work, then halt the
pending action(s) to disk and/or array, increment N, restart that
disk or the entire array as needed, and retry that action in a
diagnostic loop, which decides whether it's a soft fail, hard
block fail, or hard disk fail" and then take the proper action
based on the diagnostic. Or it could be "map that disk out and
run diagnostics on it while the hot spare is swapped in" based
on whether there's a hot spare or not.

The diagnosis engines and sd driver are open source :-)
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/fm/modules/common/zfs-diagnosis/zfs_de.c
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/scsi/targets/sd.c

But yes, some thought is needed. I always tend to pick the side
of "let the user/admin pick the way they want to fail" which
may not be needed or wanted.

Interesting.  If you have thoughts along this line, fm-discuss or
driver-discuss can be a better forum than zfs-discuss (ZFS is
a consumer of time-related failure notifications).
 -- richard

Once the state changes to DEGRADED, the admin must
zpool clear the errors to return the state to normal. Make sure
your definition of degraded matches.
I still like "depraved"... 8-)

In my experience, disk drive firmware quality and
feature sets vary
widely.  I've got a bunch of scars from shaky
firmware and I even
got a new one a few months ago. So perhaps one day
the disk vendors will perfect their firmware? :-)
Yep - see "junior EE as disk firmware programmer" above.

So you want some scars too? :-)
Probably. It's nice to find someone else who uses the scars
analogy. I was just this Christmas pointing out the assortment
of thin, straight scars on my hands to a nephew to whom I gave
a new knife for the holiday.

Another way to put it is that experience is what you have
left after you've forgotten their name.

R.G.
--
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to