Ok, I've done some more testing today and I almost don't know where to start.
I'll begin with the good news for Miles :) - Rebooting doesn't appear to cause ZFS to loose the resilver status (but see 1. below) - Resilvering appears to work fine, once complete I never saw any checksum errors when scrubbing the pool. - Reconnecting iscsi drives causes zfs to automatically online the pool and automatically begin resilvering. And now the bad news: 1. While rebooting doesn't seem cause the resilver to loose it's status, something's causing it problems. I saw it restart several times. 2. With iscsi, you can't reboot with sendtargets enabled, static discovery still seems to be the order of the day. 3. There appears to be a disconnect between what iscsiadm knows and what ZFS knows about the status of the devices. And I have confirmation of some of my earlier findings too: 4. iSCSI still has a 3 minute timeout, during which time your pool will hang, no matter how many redundant drives you have available. 5. zpool status can still hang when a device goes offline, and when it finally recovers, it will then report out of date information. This could be Bug 6667199, but I've not seen anybody reporting the incorrect information part of this. 6. After one drive goes offline, during the resilver process, zpool status shows that information is being resilvered on the good drives. Does anybody know why this happens? 7. Although ZFS will automatically online a pool when iscsi devices come online, CIFS shares are not automatically remounted. I also have a few extra notes about a couple of those: 1 - resilver loosing status =============== Regarding the resilver restarting, I've seen it reported that "zpool status" can cause this when run as admin, but I'm not convinced that's the cause. Same for the rebooting problem. I was able to run "zpool status" dozens of times as an admin, but only two or three times did I see the resilver restart. Also, after rebooting, I could see that the resilver was showing that it was 66% complete, but then a second later it restarted. Now, none of this is conclusive. I really need to test with a much larger dataset to get an idea of what's really going on, but there's definately something weird happening here. 3 - disconnect between iscsiadm and ZFS ========================= I repeated my test of offlining an iscsi target, this time checking iscsiadm to see when it disconnected. What I did was wait until iscsiadm reported 0 connections to the target, and then started a CIFS file copy and ran "zpool status". Zpool status hung as expected, and a minute or so later, the CIFS copy failed. It seems that although iscsiadm was aware that the target was offline, ZFS did not yet know about it. As expected, a minute or so later, zpool status completed (returning incorrect results), and I could then run the CIFS copy fine. 5 - zpool status hanging and reporting incorrect information =================================== When an iSCSI device goes offline, if you immediately run zpool status, it hangs for 3-4 minutes. Also, when it finally completes, it gives incorrect information, reporting all the devices as online. If you immediately re-run zpool status, it completes rapidly and will now correctly show the offline devices. -- This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss