Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed

Richard Elling Wed, 30 Jul 2008 11:19:43 -0700

I was able to reproduce this in b93, but might have a different
interpretation of the conditions.  More below...

Ross Smith wrote:
> A little more information today.  I had a feeling that ZFS would 
> continue quite some time before giving an error, and today I've shown 
> that you can carry on working with the filesystem for at least half an 
> hour with the disk removed.
>  
> I suspect on a system with little load you could carry on working for 
> several hours without any indication that there is a problem.  It 
> looks to me like ZFS is caching reads & writes, and that provided 
> requests can be fulfilled from the cache, it doesn't care whether the 
> disk is present or not.

In my USB-flash-disk-sudden-removal-while-writing-big-file-test,
1. I/O to the missing device stopped (as I expected)
2. FMA kicked in, as expected.
3. /var/adm/messages recorded "Command failed to complete... device gone."
4. After exactly 9 minutes, 17,951 e-reports had been processed and the
diagnosis was complete.  FMA logged the following to /var/adm/messages

  Jul 30 10:33:44 grond scsi: [ID 107833 kern.warning] WARNING:   
/[EMAIL PROTECTED],0/pci1458,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL 
PROTECTED],0 (sd1):
  Jul 30 10:33:44 grond     Command failed to complete...Device is gone
  Jul 30 10:42:31 grond fmd: [ID 441519 daemon.error] SUNW-MSG-ID: 
ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major
  Jul 30 10:42:31 grond EVENT-TIME: Wed Jul 30 10:42:30 PDT 2008
  Jul 30 10:42:31 grond PLATFORM:  , CSN:  , HOSTNAME: grond
  Jul 30 10:42:31 grond SOURCE: zfs-diagnosis, REV: 1.0
  Jul 30 10:42:31 grond EVENT-ID: d99769aa-28e8-cf16-d181-945592130525
  Jul 30 10:42:31 grond DESC: The number of I/O errors associated with a 
ZFS device exceeded
  Jul 30 10:42:31 grond          acceptable levels.  Refer to 
http://sun.com/msg/ZFS-8000-FD for more information.
  Jul 30 10:42:31 grond AUTO-RESPONSE: The device has been offlined and 
marked as faulted.  An attempt
  Jul 30 10:42:31 grond          will be made to activate a hot spare if 
available.
  Jul 30 10:42:31 grond IMPACT: Fault tolerance of the pool may be 
compromised.
  Jul 30 10:42:31 grond REC-ACTION: Run 'zpool status -x' and replace 
the bad device.

The above URL shows what you expect, but more (and better) info
is available from zpool status -xv

    pool: rmtestpool
   state: UNAVAIL
  status: One or more devices are faultd in response to IO failures.
  action: Make sure the affected devices are connected, then run 'zpool 
clear'.
     see: http://www.sun.com/msg/ZFS-8000-HC
   scrub: none requested
  config:

      NAME        STATE     READ WRITE CKSUM
      rmtestpool  UNAVAIL      0 15.7K     0  insufficient replicas
        c2t0d0p0  FAULTED      0 15.7K     0  experienced I/O failures

  errors: Permanent errors have been detected in the following files:

          /rmtestpool/random.data

If you surf to http://www.sun.com/msg/ZFS-8000-HC you'll
see words to the effect that,
    The pool has experienced I/O failures. Since the ZFS pool property
  'failmode' is set to 'wait', all I/Os (reads and writes) are
  blocked. See the zpool(1M) manpage for more information on the
  'failmode' property. Manual intervention is required for I/Os to
  be serviced.

>  
> I would guess that ZFS is attempting to write to the disk in the 
> background, and that this is silently failing.

It is clearly not silently failing.

However, the default failmode property is set to "wait" which will patiently
wait forever.  If you would rather have the I/O fail, then you should change
the failmode to "continue"  I would not normally recommend a failmode of
"panic"

Now to figure out how to recover gracefully... zpool clear isn't happy...

[sidebar]
while performing this experiment, I noticed that fmd was checkpointing
the diagnosis engine to disk in the /var/fm/fmd/ckpt/zfs-diagnosis 
directory.
If this had been the boot disk, with failmode=wait, I'm not convinced
that we'd get a complete diagnosis... I'll explore that later.
[/sidebar]

 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed

Reply via email to