> On Dec 31, 2009, at 6:14 PM, Richard Elling wrote: > Some nits: > disks aren't marked as semi-bad, but if ZFS has trouble with a > block, it will try to not use the block again. So there is two levels > of recovery at work: whole device and block. Ah. I hadn't found that yet.
> The "one more and you're dead" is really N errors in T time. I'm interpreting this as "OS/S/zfs/drivers will not mark a disk as failed until it returns N errors in T time," which means - check me on this - that to get a second failed disk, the time to get a second real-or-fake failed disk is T, where T is the time a second soft-failing disk may happen while the system is balled up in worrying about the first disk not responding in T time. This based on a paper I read on line about the increasing need for raidz3 or similar over raidz2 or similar because throughput from disks has not increased concomitantly with their size; this leading to increasing times to recover from first failures using the stored checking data in the array to to rebuild. The notice-an-error time plus the rebuild-the-array time is the window in which losing another disk, soft or hard, will lead to the inability to resilver the array. > For disks which don't return when there is an error, you can > reasonably expect that T will be a long time (multiples of 60 > seconds) and therefore the N in T threshold will not be triggered. The scenario I had in mind was two disks ready to fail, either soft (long time to return data) or hard (bang! That sector/block or disk is not coming back, period). The first fails and starts trying to recover in desktop-disk fashion, maybe taking hours. This leaves the system with no error report (i.e. the N-count is zero) and the T-timer ticking. Meanwhile the array is spinning. The second fragile disk is going to hit its own personal pothole at some point soon in this scenario. What happens next is not clear to me. Is OS/S/zfs going to suspend disk operations until it finally does hear from first failing disk 1, based on N still being at 0 because the disk hasn't reported back yet? Or will the array continue with other operations, noting that the operation involving failing disk1 has not completed, and either stack another request on failing disk 1, or access failing disk 2 and get its error too at some point? Or both? If the timeout is truly N errors in T time, and N is never reported back because the disk spends some hours retrying, then it looks like this is a zfs hang if not a system hang. If there is a timeout of some kind which takes place even if N never gets over 0, that would at least unhang the file system/system, but it opens you to the second failing disk fault having occurred, and you're in for another of either hung-forever or failed-array in the case of raidz. > The term "degraded" does not have a consistent > definition across the industry. Of course not! 8-) Maybe we should use "depraved" 8-) > See the zpool man page for the definition > used for ZFS. In particular, DEGRADED != FAULTED > Issues are logged, for sure. If you want to monitor > them proactively, > you need to configure SNMP traps for FMA. Ok, can deal with that. > It already does this, as long as there are N errors > in T time. OK, I can work that one out. I'm still puzzled on what happens with the "N=0 forever" case. The net result on that one seems to be that you need raid specific disks to get some kind of timeout to happen at the disk level ever (depending on the disk firmware, which as you note later, is likely to have been written by a junior EE as his first assignment 8-) ) >There is room for improvement here, but I'm not sure how > one can set a rule that would explicitly take care of the I/O never > returning from a disk while a different I/O to the same disk > returns. More research required here... Yep. I'm thinking that it might be possible to do a policy-based setup section for an array where you could select one of a number of rule-sets for what to do, based on your experience and/or paranoia about the disks in your array. I had good luck with that in a primitive whole-machine hardware diagnosis system I worked with at one point in the dim past. Kind of "if you can't do the right/perfect thing, then ensure that *something* happens." One of the rules scenarios might be "if one seek to a disk never returns and other actions to that disk to work, then halt the pending action(s) to disk and/or array, increment N, restart that disk or the entire array as needed, and retry that action in a diagnostic loop, which decides whether it's a soft fail, hard block fail, or hard disk fail" and then take the proper action based on the diagnostic. Or it could be "map that disk out and run diagnostics on it while the hot spare is swapped in" based on whether there's a hot spare or not. But yes, some thought is needed. I always tend to pick the side of "let the user/admin pick the way they want to fail" which may not be needed or wanted. > Once the state changes to DEGRADED, the admin must > zpool clear the errors to return the state to normal. Make sure > your definition of degraded matches. I still like "depraved"... 8-) > In my experience, disk drive firmware quality and > feature sets vary > widely. I've got a bunch of scars from shaky > firmware and I even > got a new one a few months ago. So perhaps one day > the disk vendors will perfect their firmware? :-) Yep - see "junior EE as disk firmware programmer" above. > So you want some scars too? :-) Probably. It's nice to find someone else who uses the scars analogy. I was just this Christmas pointing out the assortment of thin, straight scars on my hands to a nephew to whom I gave a new knife for the holiday. Another way to put it is that experience is what you have left after you've forgotten their name. R.G. -- This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss