On Tue, 2006-05-16 at 10:32 -0700, Eric Schrock wrote: > On Wed, May 17, 2006 at 03:22:34AM +1000, grant beattie wrote: > > > > what I find interesting is that the SCSI errors were continuous for 10 > > minutes before I detached it, ZFS wasn't backing off at all. it was > > flooding the VGA console quicker than the console could print it all > > :) from what you said above, once per minute would have been more > > desirable. > > The "once per minute" is related to the frequency at which ZFS tries to > reopen the device. Regardless, ZFS will try to issue I/O to the device > whenever asked. If you believe the device is completely broken, the > correct procedure (as documented in the ZFS Administration Guide), is to > 'zpool offline' the device until you are able to repair it. > > > I wonder why, given that ZFS knew there was a problem with this disk, > > that it wasn't marked FAULTED and the pool DEGRADED? > > This is the future enhancement that I described below. We need more > sophisticated analysis than simply 'N errors = FAULTED', and that's what > FMA provides. It will allow us to interact with larger fault management > (such as correlating SCSI errors, identifying controller failure, and > more). ZFS is a intentionally dumb. Each subsystem is responsible for > reporting errors, but coordinated fault diagnosis has to happen at a > higher level.
[reason #8752, why pulling disk drives doesn't simulate real failures] There are also a number of cases where a successful or unsuccessful+retryable error codes carry the recommendation to replace the drive. There really isn't a clean way to write such diagnosis engines into the various file systems, LVMs, or databases which might use disk drives. Putting that intelligence into an FMA DE and tying that into file systems or LVMs is the best way to do this. -- richard _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss