Re: [zfs-discuss] ZFS recovery from a disk losing power

Sanjay Nadkarni Thu, 18 May 2006 22:42:56 -0700

Since it's not exactly clear what you did with SVM I am assuming thefollowing:

You had a file system on top of the mirror and there was some I/Ooccurring to the mirror. The *only* time, SVM puts a device intomaintenance is when we receive an EIO from the underlying device. So,in case a write occurred to the mirror, then the write to the poweredoff side failed (returned an EIO) and SVM kept going. Since all bufferssent to sd/ssd are marked with B_FAILFAST, the driver timeouts are lowand the device is put into maintenance.

If I understand Eric correctly, ZFS attempts to see if the device isreally gone. However I am not quite sure what Eric means by:

We currently only detect device failure when the device "goes away".

Perhaps the issue here that ldi_open is successful when it should n'tand therefore confusing ZFS.

Another way to check is perform the same test, without any I/Ooccurring to the file system. Then run metastat -i (as root). This issimilar to scrub for the volumes.

-Sanjay




Richard Elling wrote:

On Tue, 2006-05-16 at 10:32 -0700, Eric Schrock wrote:

On Wed, May 17, 2006 at 03:22:34AM +1000, grant beattie wrote:

what I find interesting is that the SCSI errors were continuous for 10
minutes before I detached it, ZFS wasn't backing off at all. it was
flooding the VGA console quicker than the console could print it all
:) from what you said above, once per minute would have been more
desirable.

The "once per minute" is related to the frequency at which ZFS tries to
reopen the device.  Regardless, ZFS will try to issue I/O to the device
whenever asked.  If you believe the device is completely broken, the
correct procedure (as documented in the ZFS Administration Guide), is to
'zpool offline' the device until you are able to repair it.

I wonder why, given that ZFS knew there was a problem with this disk,
that it wasn't marked FAULTED and the pool DEGRADED?

This is the future enhancement that I described below.  We need more
sophisticated analysis than simply 'N errors = FAULTED', and that's what
FMA provides.  It will allow us to interact with larger fault management
(such as correlating SCSI errors, identifying controller failure, and
more).  ZFS is a intentionally dumb.  Each subsystem is responsible for
reporting errors, but coordinated fault diagnosis has to happen at a
higher level.


[reason #8752, why pulling disk drives doesn't simulate real failures]

There are also a number of cases where a successful orunsuccessful+retryable error codes carry the recommendation to replace

the drive.  There really isn't a clean way to write such diagnosis
engines into the various file systems, LVMs, or databases which might
use disk drives.  Putting that intelligence into an FMA DE and tying
that into file systems or LVMs is the best way to do this.
-- richard


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS recovery from a disk losing power

Reply via email to