Re: [zfs-discuss] Data loss bug - sidelined??

Ross Smith Fri, 06 Feb 2009 10:51:12 -0800

I can check on Monday, but the system will probably panic... which
doesn't really help :-)


Am I right in thinking failmode=wait is still the default?  If so,
that should be how it's set as this testing was done on a clean
install of snv_106.  From what I've seen, I don't think this is a
problem with the zfs failmode.  It's more of an issue of what happens
in the period *before* zfs realises there's a problem and applies the
failmode.

This time there was just a window of a couple of minutes while
commands would continue.  In the past I've managed to stretch it out
to hours.

To me the biggest problems are:
- ZFS accepting writes that don't happen (from both before and after
the drive is removed)
- No logging or warning of this in zpool status

I appreciate that if you're using cache, some data loss is pretty much
inevitable when a pool fails, but that should be a few seconds worth
of data at worst, not minutes or hours worth.

Also, if a pool fails completely and there's data in the cache that
hasn't been committed to disk, it would be great if Solaris could
respond by:

- immediately dumping the cache to any (all?) working storage
- prompting the user to fix the pool, or save the cache before
powering down the system

Ross


On Fri, Feb 6, 2009 at 5:49 PM, Richard Elling <richard.ell...@gmail.com> wrote:
> Ross, this is a pretty good description of what I would expect when
> failmode=continue. What happens when failmode=panic?
> -- richard
>
>
> Ross wrote:
>>
>> Ok, it's still happening in snv_106:
>>
>> I plugged a USB drive into a freshly installed system, and created a
>> single disk zpool on it:
>> # zpool create usbtest c1t0d0
>>
>> I opened the (nautilus?) file manager in gnome, and copied the /etc/X11
>> folder to it.  I then copied the /etc/apache folder to it, and at 4:05pm,
>> disconnected the drive.
>>
>> At this point there are *no* warnings on screen, or any indication that
>> there is a problem.  To check that the pool was still working, I created
>> duplicates of the two folders on that drive.  That worked without any
>> errors, although the drive was physically removed.
>>
>> 4:07pm
>> I ran zpool status, the pool is actually showing as unavailable, so at
>> least that has happened faster than my last test.
>>
>> The folder is still open in gnome, however any attempt to copy files to or
>> from it just hangs the file transfer operation window.
>>
>> 4:09pm
>> /usbtest is still visible in gnome
>> Also, I can still open a console and use the folder:
>>
>> # cd usbtest
>> # ls
>> X11            X11 (copy)     apache         apache (copy)
>>
>> I also tried:
>> # mv X11 X11-test
>>
>> That hung, but I saw the X11 folder disappear from the graphical file
>> manager, so the system still believes something is working with this pool.
>>
>> The main GUI is actually a little messed up now.  The gnome file manager
>> window looking at the /usbtest folder has hung.  Also, right-clicking the
>> desktop to open a new terminal hangs, leaving the right-click menu on
>> screen.
>>
>> The main menu still works though, and I can still open a new terminal.
>>
>> 4:19pm
>> Commands such as ls are finally hanging on the pool.
>>
>> At this point I tried to reboot, but it appears that isn't working.  I
>> used system monitor to kill everything I had running and tried again, but
>> that didn't help.
>>
>> I had to physically power off the system to reboot.
>>
>> After the reboot, as expected, /usbtest still exists (even though the
>> drive is disconnected).  I removed that folder and connected the drive.
>>
>> ZFS detects the insertion and automounts the drive, but I find that
>> although the pool is showing as online, and the filesystem shows as mounted
>> at /usbtest.  But the /usbtest directory doesn't exist.
>>
>> I had to export and import the pool to get it available, but as expected,
>> I've lost data:
>> # cd usbtest
>> # ls
>> X11
>>
>> even worse, zfs is completely unaware of this:
>> # zpool status -v usbtest
>>  pool: usbtest
>>  state: ONLINE
>>  scrub: none requested
>> config:
>>
>>        NAME        STATE     READ WRITE CKSUM
>>        usbtest     ONLINE       0     0     0
>>          c1t0d0    ONLINE       0     0     0
>>
>> errors: No known data errors
>>
>>
>> So in summary, there are a good few problems here, many of which I've
>> already reported as bugs:
>>
>> 1. ZFS still accepts read and write operations for a faulted pool, causing
>> data loss that isn't necessarily reported by zpool status.
>> 2. Even after writes start to hang, it's still possible to continue
>> reading data from a faulted pool.
>> 3. A faulted pool causes unwanted side effects in the GUI, making the
>> system hard to use, and impossible to reboot.
>> 4. After a hard reset, ZFS does not recover cleanly.  Unused mountpoints
>> are left behind.
>> 5. Automatic mounting of pools doesn't seem to work reliably.
>> 6. zfs status doesn't inform of any problems mounting the pool.
>>
>
>
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Data loss bug - sidelined??

Reply via email to