Since it's not exactly clear what you did with SVM I am assuming the
following:
You had a file system on top of the mirror and there was some I/O
occurring to the mirror. The *only* time, SVM puts a device into
maintenance is when we receive an EIO from the underlying device. So,
in case a write occurred to the mirror, then the write to the powered
off side failed (returned an EIO) and SVM kept going. Since all buffers
sent to sd/ssd are marked with B_FAILFAST, the driver timeouts are low
and the device is put into maintenance.
If I understand Eric correctly, ZFS attempts to see if the device is
really gone. However I am not quite sure what Eric means by:
We currently only detect device failure when the device "goes away".
Perhaps the issue here that ldi_open is successful when it should n't
and therefore confusing ZFS.
Another way to check is perform the same test, without any I/O
occurring to the file system. Then run metastat -i (as root). This is
similar to scrub for the volumes.
-Sanjay
Richard Elling wrote:
On Tue, 2006-05-16 at 10:32 -0700, Eric Schrock wrote:
On Wed, May 17, 2006 at 03:22:34AM +1000, grant beattie wrote:
what I find interesting is that the SCSI errors were continuous for 10
minutes before I detached it, ZFS wasn't backing off at all. it was
flooding the VGA console quicker than the console could print it all
:) from what you said above, once per minute would have been more
desirable.
The "once per minute" is related to the frequency at which ZFS tries to
reopen the device. Regardless, ZFS will try to issue I/O to the device
whenever asked. If you believe the device is completely broken, the
correct procedure (as documented in the ZFS Administration Guide), is to
'zpool offline' the device until you are able to repair it.
I wonder why, given that ZFS knew there was a problem with this disk,
that it wasn't marked FAULTED and the pool DEGRADED?
This is the future enhancement that I described below. We need more
sophisticated analysis than simply 'N errors = FAULTED', and that's what
FMA provides. It will allow us to interact with larger fault management
(such as correlating SCSI errors, identifying controller failure, and
more). ZFS is a intentionally dumb. Each subsystem is responsible for
reporting errors, but coordinated fault diagnosis has to happen at a
higher level.
[reason #8752, why pulling disk drives doesn't simulate real failures]
There are also a number of cases where a successful or
unsuccessful+retryable error codes carry the recommendation to replace
the drive. There really isn't a clean way to write such diagnosis
engines into the various file systems, LVMs, or databases which might
use disk drives. Putting that intelligence into an FMA DE and tying
that into file systems or LVMs is the best way to do this.
-- richard
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss