I was able to reproduce this in b93, but might have a different interpretation of the conditions. More below...
Ross Smith wrote: > A little more information today. I had a feeling that ZFS would > continue quite some time before giving an error, and today I've shown > that you can carry on working with the filesystem for at least half an > hour with the disk removed. > > I suspect on a system with little load you could carry on working for > several hours without any indication that there is a problem. It > looks to me like ZFS is caching reads & writes, and that provided > requests can be fulfilled from the cache, it doesn't care whether the > disk is present or not. In my USB-flash-disk-sudden-removal-while-writing-big-file-test, 1. I/O to the missing device stopped (as I expected) 2. FMA kicked in, as expected. 3. /var/adm/messages recorded "Command failed to complete... device gone." 4. After exactly 9 minutes, 17,951 e-reports had been processed and the diagnosis was complete. FMA logged the following to /var/adm/messages Jul 30 10:33:44 grond scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],0/pci1458,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd1): Jul 30 10:33:44 grond Command failed to complete...Device is gone Jul 30 10:42:31 grond fmd: [ID 441519 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major Jul 30 10:42:31 grond EVENT-TIME: Wed Jul 30 10:42:30 PDT 2008 Jul 30 10:42:31 grond PLATFORM: , CSN: , HOSTNAME: grond Jul 30 10:42:31 grond SOURCE: zfs-diagnosis, REV: 1.0 Jul 30 10:42:31 grond EVENT-ID: d99769aa-28e8-cf16-d181-945592130525 Jul 30 10:42:31 grond DESC: The number of I/O errors associated with a ZFS device exceeded Jul 30 10:42:31 grond acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more information. Jul 30 10:42:31 grond AUTO-RESPONSE: The device has been offlined and marked as faulted. An attempt Jul 30 10:42:31 grond will be made to activate a hot spare if available. Jul 30 10:42:31 grond IMPACT: Fault tolerance of the pool may be compromised. Jul 30 10:42:31 grond REC-ACTION: Run 'zpool status -x' and replace the bad device. The above URL shows what you expect, but more (and better) info is available from zpool status -xv pool: rmtestpool state: UNAVAIL status: One or more devices are faultd in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://www.sun.com/msg/ZFS-8000-HC scrub: none requested config: NAME STATE READ WRITE CKSUM rmtestpool UNAVAIL 0 15.7K 0 insufficient replicas c2t0d0p0 FAULTED 0 15.7K 0 experienced I/O failures errors: Permanent errors have been detected in the following files: /rmtestpool/random.data If you surf to http://www.sun.com/msg/ZFS-8000-HC you'll see words to the effect that, The pool has experienced I/O failures. Since the ZFS pool property 'failmode' is set to 'wait', all I/Os (reads and writes) are blocked. See the zpool(1M) manpage for more information on the 'failmode' property. Manual intervention is required for I/Os to be serviced. > > I would guess that ZFS is attempting to write to the disk in the > background, and that this is silently failing. It is clearly not silently failing. However, the default failmode property is set to "wait" which will patiently wait forever. If you would rather have the I/O fail, then you should change the failmode to "continue" I would not normally recommend a failmode of "panic" Now to figure out how to recover gracefully... zpool clear isn't happy... [sidebar] while performing this experiment, I noticed that fmd was checkpointing the diagnosis engine to disk in the /var/fm/fmd/ckpt/zfs-diagnosis directory. If this had been the boot disk, with failmode=wait, I'm not convinced that we'd get a complete diagnosis... I'll explore that later. [/sidebar] -- richard _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss