> On Dec 31, 2009, at 6:14 PM, Richard Elling wrote:
> Some nits:
> disks aren't marked as semi-bad, but if ZFS has trouble with a
> block, it will try to not use the block again.  So there is two levels
> of recovery at work: whole device and block.
Ah. I hadn't found that yet.

> The "one more and you're dead" is really N errors in T time.
I'm interpreting this as "OS/S/zfs/drivers will not mark a disk 
as failed until it returns N errors in T time," which means - 
check me on this - that to get a second failed disk, the time
to get a second real-or-fake failed disk is T, where T is the 
time a second soft-failing disk may happen while the system
is balled up in worrying about the first disk not responding in
T time.

This based on a paper I read on line about the increasing
need for raidz3 or similar over raidz2 or similar because 
throughput from disks has not increased concomitantly with 
their size; this leading to increasing times to recover from 
first failures using the stored checking data in the array to
to rebuild. The notice-an-error time plus the rebuild-the-array
time is the window in which losing another disk, soft or hard, 
will lead to the inability to resilver the array.

> For disks which don't return when there is an error, you can
> reasonably expect that T will be a long time (multiples of 60
> seconds) and therefore the N in T threshold will not be triggered.
The scenario I had in mind was two disks ready to fail, either
soft (long time to return data) or hard (bang! That sector/block
or disk is not coming back, period). The first fails and starts 
trying to recover in desktop-disk fashion, maybe taking hours.

This leaves the system with no error report (i.e. the N-count is
zero) and the T-timer ticking. Meanwhile the array is spinning.
The second fragile disk is going to hit its own personal pothole
at some point soon in this scenario. 

What happens next is not clear to me. Is OS/S/zfs going to 
suspend disk operations until it finally does hear from first
failing disk 1, based on N still being at 0 because the disk 
hasn't reported back yet? Or will the array continue with other
operations, noting that the operation involving failing disk1
has not completed, and either stack another request on 
failing disk 1, or access failing disk 2 and get its error too
at some point? Or both?

If the timeout is truly N errors in T time, and N is never
reported back because the disk spends some hours retrying, 
then it looks like this is a zfs hang if not a system hang.

If there is a timeout of some kind which takes place even 
if N never gets over 0, that would at least unhang the 
file system/system, but it opens you to the second failing
disk fault having occurred, and you're in for another of
either hung-forever or failed-array in the case of raidz. 

> The term "degraded" does not have a consistent
> definition across the industry. 
Of course not! 8-)  Maybe we should use "depraved" 8-)

> See the zpool man page for the definition
> used for ZFS.  In particular, DEGRADED != FAULTED

> Issues are logged, for sure.  If you want to monitor
> them proactively,
> you need to configure SNMP traps for FMA.
Ok, can deal with that.

> It already does this, as long as there are N errors
> in T time.  
OK, I can work that one out. I'm still puzzled on what
happens with the "N=0 forever" case. The net result
on that one seems to be that  you need raid specific
disks to get some kind of timeout to happen at the 
disk level ever (depending on the disk firmware, 
which as you note later, is likely to have been written
by a junior EE as his first assignment 8-) )


>There is room for improvement here, but I'm not sure how
> one can set a rule that would explicitly take care of the I/O never
> returning from a disk while a different I/O to the same disk
> returns.  More research required here...
Yep. I'm thinking that it might be possible to do a policy-based
setup section for an array where you could select one of a number
of rule-sets for what to do, based on your experience and/or
paranoia about the disks in your array. I had good luck with that
in a primitive whole-machine hardware diagnosis system I worked
with at one point in the dim past. Kind of "if you can't do the 
right/perfect thing, then ensure that *something* happens."

One of the rules scenarios might be "if one seek to a disk never 
returns and other actions to that disk to work, then halt the 
pending action(s) to disk and/or array, increment N, restart that
disk or the entire array as needed, and retry that action in a 
diagnostic loop, which decides whether it's a soft fail, hard 
block fail, or hard disk fail" and then take the proper action 
based on the diagnostic. Or it could be "map that disk out and
run diagnostics on it while the hot spare is swapped in" based 
on whether there's a hot spare or not. 

But yes, some thought is needed. I always tend to pick the side
of "let the user/admin pick the way they want to fail" which 
may not be needed or wanted.

> Once the state changes to DEGRADED, the admin must
> zpool clear the errors to return the state to normal. Make sure
> your definition of degraded matches.
I still like "depraved"... 8-)

> In my experience, disk drive firmware quality and
> feature sets vary
> widely.  I've got a bunch of scars from shaky
> firmware and I even
> got a new one a few months ago. So perhaps one day
> the disk vendors will perfect their firmware? :-)
Yep - see "junior EE as disk firmware programmer" above.

> So you want some scars too? :-)
Probably. It's nice to find someone else who uses the scars
analogy. I was just this Christmas pointing out the assortment
of thin, straight scars on my hands to a nephew to whom I gave
a new knife for the holiday.

Another way to put it is that experience is what you have 
left after you've forgotten their name.

R.G.
-- 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to