On Fri, Feb 6, 2009 at 10:50 AM, Ross Smith <myxi...@googlemail.com> wrote: > I can check on Monday, but the system will probably panic... which > doesn't really help :-) > > Am I right in thinking failmode=wait is still the default? If so, > that should be how it's set as this testing was done on a clean > install of snv_106. From what I've seen, I don't think this is a > problem with the zfs failmode. It's more of an issue of what happens > in the period *before* zfs realises there's a problem and applies the > failmode. > > This time there was just a window of a couple of minutes while > commands would continue. In the past I've managed to stretch it out > to hours. > > To me the biggest problems are: > - ZFS accepting writes that don't happen (from both before and after > the drive is removed) > - No logging or warning of this in zpool status > > I appreciate that if you're using cache, some data loss is pretty much > inevitable when a pool fails, but that should be a few seconds worth > of data at worst, not minutes or hours worth. > > Also, if a pool fails completely and there's data in the cache that > hasn't been committed to disk, it would be great if Solaris could > respond by: > > - immediately dumping the cache to any (all?) working storage > - prompting the user to fix the pool, or save the cache before > powering down the system > > Ross > > > On Fri, Feb 6, 2009 at 5:49 PM, Richard Elling <richard.ell...@gmail.com> > wrote: >> Ross, this is a pretty good description of what I would expect when >> failmode=continue. What happens when failmode=panic? >> -- richard >> >> >> Ross wrote: >>> >>> Ok, it's still happening in snv_106: >>> >>> I plugged a USB drive into a freshly installed system, and created a >>> single disk zpool on it: >>> # zpool create usbtest c1t0d0 >>> >>> I opened the (nautilus?) file manager in gnome, and copied the /etc/X11 >>> folder to it. I then copied the /etc/apache folder to it, and at 4:05pm, >>> disconnected the drive. >>> >>> At this point there are *no* warnings on screen, or any indication that >>> there is a problem. To check that the pool was still working, I created >>> duplicates of the two folders on that drive. That worked without any >>> errors, although the drive was physically removed. >>> >>> 4:07pm >>> I ran zpool status, the pool is actually showing as unavailable, so at >>> least that has happened faster than my last test. >>> >>> The folder is still open in gnome, however any attempt to copy files to or >>> from it just hangs the file transfer operation window. >>> >>> 4:09pm >>> /usbtest is still visible in gnome >>> Also, I can still open a console and use the folder: >>> >>> # cd usbtest >>> # ls >>> X11 X11 (copy) apache apache (copy) >>> >>> I also tried: >>> # mv X11 X11-test >>> >>> That hung, but I saw the X11 folder disappear from the graphical file >>> manager, so the system still believes something is working with this pool. >>> >>> The main GUI is actually a little messed up now. The gnome file manager >>> window looking at the /usbtest folder has hung. Also, right-clicking the >>> desktop to open a new terminal hangs, leaving the right-click menu on >>> screen. >>> >>> The main menu still works though, and I can still open a new terminal. >>> >>> 4:19pm >>> Commands such as ls are finally hanging on the pool. >>> >>> At this point I tried to reboot, but it appears that isn't working. I >>> used system monitor to kill everything I had running and tried again, but >>> that didn't help. >>> >>> I had to physically power off the system to reboot. >>> >>> After the reboot, as expected, /usbtest still exists (even though the >>> drive is disconnected). I removed that folder and connected the drive. >>> >>> ZFS detects the insertion and automounts the drive, but I find that >>> although the pool is showing as online, and the filesystem shows as mounted >>> at /usbtest. But the /usbtest directory doesn't exist. >>> >>> I had to export and import the pool to get it available, but as expected, >>> I've lost data: >>> # cd usbtest >>> # ls >>> X11 >>> >>> even worse, zfs is completely unaware of this: >>> # zpool status -v usbtest >>> pool: usbtest >>> state: ONLINE >>> scrub: none requested >>> config: >>> >>> NAME STATE READ WRITE CKSUM >>> usbtest ONLINE 0 0 0 >>> c1t0d0 ONLINE 0 0 0 >>> >>> errors: No known data errors >>> >>> >>> So in summary, there are a good few problems here, many of which I've >>> already reported as bugs: >>> >>> 1. ZFS still accepts read and write operations for a faulted pool, causing >>> data loss that isn't necessarily reported by zpool status. >>> 2. Even after writes start to hang, it's still possible to continue >>> reading data from a faulted pool. >>> 3. A faulted pool causes unwanted side effects in the GUI, making the >>> system hard to use, and impossible to reboot. >>> 4. After a hard reset, ZFS does not recover cleanly. Unused mountpoints >>> are left behind. >>> 5. Automatic mounting of pools doesn't seem to work reliably. >>> 6. zfs status doesn't inform of any problems mounting the pool. >>> >> >> > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Could this be related to the ZFS TXG/transfer group buffers? ie. it'll buffer writes for a bit before committing to disk. Then, when its time to commit to disk, it realizes the disk is failed, and from then enter those failmode conditions (wait, continue, panic, ?). Could this be the case? http://blogs.sun.com/roch/date/20080514 -- Brent Jones br...@servuhome.net _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss