On Dec 31, 2009, at 6:14 PM, R.G. Keen wrote:
On Thu, 31 Dec 2009, Bob Friesenhahn wrote:
I like the nice and short answer from this "Bob
Friesen" fellow the
best. :-)
It was succinct, wasn't it? 8-)
Sorry - I pulled the attribution from the ID, not the
signature which was waiting below. DOH!
When you say:
It does not really matter what Solaris or ZFS does if the drive
essentially locks up when it is trying to recover a bad sector.
I'd have to say that it depends. If Solaris/zfs/etc. is restricted
to actions which consist of marking the disk semi-permanently
bad and continuing, yes, it amounts to the same thing: it opens
a yawning chasm of "one more error and you're dead," until the
array can be serviced and un-degraded. At least I think it
does, based on what I've read, anyway.
Some nits:
disks aren't marked as semi-bad, but if ZFS has trouble with a
block, it will try to not use the block again. So there is two levels
of recovery at work: whole device and block.
The "one more and you're dead" is really N errors in T time.
For disks which don't return when there is an error, you can
reasonably expect that T will be a long time (multiples of 60
seconds) and therefore the N in T threshold will not be triggered.
The term "degraded" does not have a consistent definition
across the industry. See the zpool man page for the definition
used for ZFS. In particular, DEGRADED != FAULTED
However, if OS/S/zfs/etc. performs an appropriate fire drill up
to and including logging the issues, quiescing the array, and
annoying the operator then it closes up the sudden-death window.
This gives the operator of the array a chance to do something
about it, such as swapping in a spare and starting
rebuilding/resilvering/etc.
Issues are logged, for sure. If you want to monitor them proactively,
you need to configure SNMP traps for FMA.
Given the largish aggregate monetary value to RAIDZ builders of
sidestepping the doubled-cost of raid specialized drives, it occurs
to me that having a special set of actions for desktop-ish drives
might be a good idea. Something like a fix-the-failed repair mode
which pulls all recoverable data off the purportedly failing drive
and onto a new spare to avoid a monster resilvering and the associated
vulnerable time to a second or third failure.
It already does this, as long as there are N errors in T time. There
is room for improvement here, but I'm not sure how one can set a
rule that would explicitly take care of the I/O never returning from
a disk while a different I/O to the same disk returns. More research
required here...
Viewed in that light, exactly what OS/S/zfs does on a long extended
reply from a disk and exactly what can be done to minimize the
time when the array runs in a degraded mode where the next step
loses the data seems to be a really important issue.
Once the state changes to DEGRADED, the admin must zpool clear
the errors to return the state to normal. Make sure your definition of
degraded matches.
Well, OK, it does to me because my purpose here is getting to
background scrubbing of errors in the disks. Other things might
be more important to others. 8-)
And the question might be moot if the SMART SCT architecture in
desktop drives lets you do a power-on hack to shorten the reply-failed
time for better raid operation. That's actually the solution I'd like
to see in a perfect world - I get back to a redundant array of
INEXPENSIVE
disks, and I can pick those disks to be big and slow/low power instead
of fast/high power.
In my experience, disk drive firmware quality and feature sets vary
widely. I've got a bunch of scars from shaky firmware and I even
got a new one a few months ago. So perhaps one day the disk
vendors will perfect their firmware? :-)
I'd welcome any enlightened speculation on this. I do recognize that
I'm an idiot on these matters compared to people with actual
experience. 8-)
So you want some scars too? :-)
-- richard
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss