Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed

Ross Smith Mon, 28 Jul 2008 11:12:28 -0700

snv_91.  I downloaded snv_94 today so I'll be testing with that tomorrow.
> Date: Mon, 28 Jul 2008 09:58:43 -0700> From: [EMAIL PROTECTED]> Subject: Re: 
> [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed> To: [EMAIL 
> PROTECTED]> > Which OS and revision?> -- richard> > > Ross wrote:> > Ok, 
> after doing a lot more testing of this I've found it's not the Supermicro 
> controller causing problems. It's purely ZFS, and it causes some major 
> problems! I've even found one scenario that appears to cause huge data loss 
> without any warning from ZFS - up to 30,000 files and 100MB of data missing 
> after a reboot, with zfs reporting that the pool is OK.> >> > 
> ***********************************************************************> > 1. 
> Solaris handles USB and SATA hot plug fine> >> > If disks are not in use by 
> ZFS, you can unplug USB or SATA devices, cfgadm will recognise the 
> disconnection. USB devices are recognised automatically as you reconnect 
> them, SATA devices need reconfiguring. Cfgadm even recognises the SATA device 
> as an empty bay:> >> > # cfgadm> > Ap_Id Type Receptacle Occupant Condition> 
> > sata1/7 sata-port empty unconfigured ok> > usb1/3 unknown empty 
> unconfigured ok> >> > -- insert devices --> >> > # cfgadm> > Ap_Id Type 
> Receptacle Occupant Condition> > sata1/7 disk connected unconfigured unknown> 
> > usb1/3 usb-storage connected configured ok> >> > To bring the sata drive 
> online it's just a case of running> > # cfgadm -c configure sata1/7 > >> > 
> ***********************************************************************> > 2. 
> If ZFS is using a hot plug device, disconnecting it will hang all ZFS status 
> tools.> >> > While pools remain accessible, any attempt to run "zpool status" 
> will hang. I don't know if there is any way to recover these tools once this 
> happens. While this is a pretty big problem in itself, it also makes me worry 
> if other types of error could have the same effect. I see potential for this 
> leaving a server in a state whereby you know there are errors in a pool, but 
> have no way of finding out what those errors might be without rebooting the 
> server.> >> > 
> ***********************************************************************> > 3. 
> Once ZFS status tools are hung the computer will not shut down.> >> > The 
> only way I've found to recover from this is to physically power down the 
> server. The solaris shutdown process simply hangs.> >> > 
> ***********************************************************************> > 4. 
> While reading an offline disk causes errors, writing does not! > > *** CAUSES 
> DATA LOSS ***> >> > This is a big one: ZFS can continue writing to an 
> unavailable pool. It doesn't always generate errors (I've seen it copy over 
> 100MB before erroring), and if not spotted, this *will* cause data loss after 
> you reboot.> >> > I discovered this while testing how ZFS coped with the 
> removal of a hot plug SATA drive. I knew that the ZFS admin tools were 
> hanging, but that redundant pools remained available. I wanted to see whether 
> it was just the ZFS admin tools that were failing, or whether ZFS was also 
> failing to send appropriate error messages back to the OS.> >> > These are 
> the tests I carried out:> >> > Zpool: Single drive zpool, consisting of one 
> 250GB SATA drive in a hot plug bay.> > Test data: A folder tree containing 
> 19,160 items. 71.1MB in total.> >> > TEST1: Opened File Browser, copied the 
> test data to the pool. Half way through the copy I pulled the drive. THE COPY 
> COMPLETED WITHOUT ERROR. Zpool list reports the pool as online, however zpool 
> status hung as expected.> >> > Not quite believing the results, I rebooted 
> and tried again.> >> > TEST2: Opened File Browser, copied the data to the 
> pool. Pulled the drive half way through. The copy again finished without 
> error. Checking the properties shows 19,160 files in the copy. ZFS list again 
> shows the filesystem as ONLINE.> >> > Now I decided to see how many files I 
> could copy before it errored. I started the copy again. File Browser managed 
> a further 9,171 files before it stopped. That's nearly 30,000 files before 
> any error was detected. Again, despite the copy having finally errored, zpool 
> list shows the pool as online, even though zpool status hangs.> >> > I 
> rebooted the server, and found that after the reboot my first copy contains 
> just 10,952 items, and my second copy is completely missing. That's a loss of 
> almost 20,000 files. Zpool status however reports NO ERRORS.> >> > For the 
> third test I decided to see if these files are actually accessible before the 
> reboot:> >> > TEST3: This time I pulled the drive *before* starting the copy. 
> The copy started much slower this time and only got to 2,939 files before 
> reporting an error. At this point I copied all the files that had been copied 
> to another pool, and then rebooted.> >> > After the reboot, the folder in the 
> test pool had disappeared completely, but the copy I took before rebooting 
> was fine and contains 2,938 items, approximately 12MB of data. Again, zpool 
> status reports no errors.> >> > Further tests revealed that reading the pool 
> results in an error almost immediately. Writing to the pool appears very 
> inconsistent.> >> > This is a huge problem. Data can be written without 
> error, and is still served to users. It is only later on that the server will 
> begin to issue errors, but at that point zfs admin tools are useless. The 
> only possible recovery is a server reboot, but that will loose recent data 
> written to the pool, but will do so without any warnings at all from ZFS. > 
> >> > Needless to say I have a lot less faith in ZFS' error checking after 
> having seen it loose 30,000 files without error.> >> > 
> ***********************************************************************> > 5. 
> If you are using CIFS and pull a drive from the volume, the whole server 
> hangs!> >> > This appears to be the original problem I found. While ZFS 
> doesn't handle drive removal well, the combination of ZFS and CIFS is worse. 
> If you pull a drive from a ZFS pool (redundant or not), which is serving CIFS 
> data, the entire server freezes until you re-insert the drive.> >> > Note 
> that ZFS itself does not recover after the drive is inserted; admin tools 
> will still hang. However the re-insertion of the drive is enough to unfreeze 
> the server.> >> > Of course, you still need a physical reboot to get your ZFS 
> admin tools back, but in the meantime data is accessible again.> > > > > > 
> This message posted from opensolaris.org> > 
> _______________________________________________> > zfs-discuss mailing list> 
> > zfs-discuss@opensolaris.org> > 
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss> > > 
_________________________________________________________________
Find the best and worst places on the planet
http://clk.atdmt.com/UKM/go/101719807/direct/01/

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed

Reply via email to