Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Ross Wed, 03 Dec 2008 07:26:35 -0800

Ok, I've done some more testing today and I almost don't know where to start.


I'll begin with the good news for Miles :)
- Rebooting doesn't appear to cause ZFS to loose the resilver status (but see 
1. below)
- Resilvering appears to work fine, once complete I never saw any checksum 
errors when scrubbing the pool.
- Reconnecting iscsi drives causes zfs to automatically online the pool and 
automatically begin resilvering.

And now the bad news:
1.  While rebooting doesn't seem cause the resilver to loose it's status, 
something's causing it problems.  I saw it restart several times.
2.  With iscsi, you can't reboot with sendtargets enabled, static discovery 
still seems to be the order of the day.
3.  There appears to be a disconnect between what iscsiadm knows and what ZFS 
knows about the status of the devices.  

And I have confirmation of some of my earlier findings too:
4.  iSCSI still has a 3 minute timeout, during which time your pool will hang, 
no matter how many redundant drives you have available.
5.  zpool status can still hang when a  device goes offline, and when it 
finally recovers, it will then report out of date information.  This could be 
Bug 6667199, but I've not seen anybody reporting the incorrect information part 
of this.
6.  After one drive goes offline, during the resilver process, zpool status 
shows that information is being resilvered on the good drives.  Does anybody 
know why this happens?
7.  Although ZFS will automatically online a pool when iscsi devices come 
online, CIFS shares are not automatically remounted.

I also have a few extra notes about a couple of those:

1 - resilver loosing status
===============
Regarding the resilver restarting, I've seen it reported that "zpool status" 
can cause this when run as admin, but I'm not convinced that's the cause.  Same 
for the rebooting problem.  I was able to run "zpool status" dozens of times as 
an admin, but only two or three times did I see the resilver restart.

Also, after rebooting, I could see that the resilver was showing that it was 
66% complete, but then a second later it restarted.

Now, none of this is conclusive.  I really need to test with a much larger 
dataset to get an idea of what's really going on, but there's definately 
something weird happening here.

3 - disconnect between iscsiadm and ZFS
=========================
I repeated my test of offlining an iscsi target, this time checking iscsiadm to 
see when it disconnected. 

What I did was wait until iscsiadm reported 0 connections to the target, and 
then started a CIFS file copy and ran "zpool status".

Zpool status hung as expected, and a minute or so later, the CIFS copy failed.  
It seems that although iscsiadm was aware that the target was offline, ZFS did 
not yet know about it.  As expected, a minute or so later, zpool status 
completed (returning incorrect results), and I could then run the CIFS copy 
fine.

5 - zpool status hanging and reporting incorrect information
===================================
When an iSCSI device goes offline, if you immediately run zpool status, it 
hangs for 3-4 minutes.  Also, when it finally completes, it gives incorrect 
information, reporting all the devices as online.

If you immediately re-run zpool status, it completes rapidly and will now 
correctly show the offline devices.
-- 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Reply via email to