[zfs-discuss] ZFS hang and boot hang when iSCSI device removed

Ross Tue, 05 Feb 2008 08:58:10 -0800

We're currently evaluating ZFS prior to (hopefully) rolling it out across our 
server room, and have managed to lock up a server after connecting to an iSCSI 
target, and then changing the IP address of the target.


Basically we have two test Solaris servers running, and I followed the 
instructions on the post below to share a zpool on Server1 using the iSCSI 
Target, and then import that into a new zpool on Server2.
http://blogs.sun.com/chrisg/date/20070418.

Everything appeared to work fine until I moved the servers to a new network 
(while powered on), which changed their IP addresses.  The server running the 
iSCSI Target is still fine, it has it's IP address and from another machine I 
can see that the iSCSI target is still visible.

However, Server2 was not as happy with the move.  As far as I can tell, all ZFS 
commands locked up on it.  I couldn't run "zfs list", "zpool list", "zpool 
status" or "zfs iostat".  Every single one locked up and I couldn't even find a 
way to stop them.  Now I've seen a few posts about ZFS commands locking up, but 
this is very concerning for something we're considering using in a production 
system.

Anyway, with Server 2 well and truly locked up, I restarted it hoping that 
would clear the problem (figuring ZFS would simply mark the device as offline), 
but found that the server can't even boot.  For the past hour it has simply 
spammed the following message to the screen:

"NOTICE: iscsi connection(27) unable to connecct to target 
iqn.1986-03.com.sun:02:3d882af1-91cc-6d9e-9f19-edfa095fca6d"

Now that I wasn't expecting.  This volume isn't a boot volume, the server 
doesn't need either ZFS or iSCSI to boot, and I don't think I even saved any 
data on that drive.  I have found a post reporting a similar message to the 
above, which was reporting a ten minute boot delay with a working iSCSI volume, 
however I can't find anything to say what happens if the iSCSI volume is no 
longer there:
http://forum.java.sun.com/thread.jspa?threadID=5243777&messageID=10004717

So, I have quite a few questions:

1.  Does anybody know how I can recover from this, or am I going to have to 
wipe my test server and start again?

2.  How vulnerable are the ZFS admin tools to locking up like this?

3.  How vulnerable is the iSCSI client to locking up like this during boot?

4.  Is there any way we can disconnect the iSCSI share while ZFS is locked up 
like this?  What could I have tried to regain control of my server before 
rebooting?

5.  If I can get the server booted, is there any way to redirect an iSCSI 
volume so it's pointing at the new IP address?  (I was expecting to simply do a 
"zpool replace" when ZFS reported the drive as missing).

And finally, does anybody know why "zpool status" should lock up like this?  
I'm really not happy that the ZFS admin tools seem so fragile.  At the very 
least I would have expected "zpool status" to be able to list the devices 
attached to the pools and report that they are timing out or erroring, and for 
me to be able to use the other ZFS tools to forcibly remove failed drives as 
needed.  Anything less means I'm risking my whole system should ZFS find 
something it doesn't like.

I admit I'm a solaris newbie, but surely something designed as a robust 
filesystem also needs robust management tools?
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS hang and boot hang when iSCSI device removed

Reply via email to