Hey folks,

I'm currently testing ZFS mounted on a pair of iSCSI targets, and am having 
problems if I disconnect either target to simulate a failure.  The whole system 
hangs for almost exactly 3 minutes whenever I do this, even though I'm running 
a mirrored zpool and the other half of the mirror is fine.  

The question I have is whether this is a ZFS bug, an iSCSI initiator bug, or 
both.  One of the guys in the storage thread managed to find this code for me 
in the iSCSI initiator, the code seems to imply that the initiator will start 
returning errors after 20 seconds, but only return a fatal error after 180.  
However, I would have expected ZFS to timeout a device on a mirrored volume 
much quicker than this, regardless of what the driver is doing.

The questions I have are:

- Is there any way to get ZFS to timeout a non responding iSCSI device faster 
than 180 seconds, regardless of what the iSCSI initiator does?
- If not, is this something suitable for an RFE for ZFS?  Should ZFS place this 
much faith in drivers, or should it be more pro-active in managing problems on 
redundant volumes?
- Do I need to raise a RFE against the iSCSI initiator asking if these values 
can be made configurable?

The original thread where I discussed this can be found here:
http://www.opensolaris.org/jive/thread.jspa?threadID=51981

And this is the code fragment from the iSCSI initiator:
205 /*
206 * NOP delay is used to send a iSCSI NOP (ie. ping) across the
207 * wire to see if the target is still alive. NOPs are only
208 * sent when the RX thread hasn't received anything for the
209 * below amount of time.
210 */
211 #define ISCSI_DEFAULT_NOP_DELAY 5 /* seconds */
212 extern int iscsi_nop_delay;
213 /*
214 * If we haven't received anything in a specified period of time
215 * we will stop accepting IO via tran start. This will enable
216 * upper level drivers to see we might be having a problem and
217 * in the case of scsi_vhci will start to route IO down a better
218 * path.
219 */
220 #define ISCSI_DEFAULT_RX_WINDOW 20 /* seconds */
221 extern int iscsi_rx_window;
222 /*
223 * If we haven't received anything in a specified period of time
224 * we will stop accepting IO via tran start. This the max limit
225 * when encountered we will start returning a fatal error.
226 */
227 #define ISCSI_DEFAULT_RX_MAX_WINDOW 180 /* seconds */
228 extern int iscsi_rx_max_window;
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to