Hey folks, I'm currently testing ZFS mounted on a pair of iSCSI targets, and am having problems if I disconnect either target to simulate a failure. The whole system hangs for almost exactly 3 minutes whenever I do this, even though I'm running a mirrored zpool and the other half of the mirror is fine.
The question I have is whether this is a ZFS bug, an iSCSI initiator bug, or both. One of the guys in the storage thread managed to find this code for me in the iSCSI initiator, the code seems to imply that the initiator will start returning errors after 20 seconds, but only return a fatal error after 180. However, I would have expected ZFS to timeout a device on a mirrored volume much quicker than this, regardless of what the driver is doing. The questions I have are: - Is there any way to get ZFS to timeout a non responding iSCSI device faster than 180 seconds, regardless of what the iSCSI initiator does? - If not, is this something suitable for an RFE for ZFS? Should ZFS place this much faith in drivers, or should it be more pro-active in managing problems on redundant volumes? - Do I need to raise a RFE against the iSCSI initiator asking if these values can be made configurable? The original thread where I discussed this can be found here: http://www.opensolaris.org/jive/thread.jspa?threadID=51981 And this is the code fragment from the iSCSI initiator: 205 /* 206 * NOP delay is used to send a iSCSI NOP (ie. ping) across the 207 * wire to see if the target is still alive. NOPs are only 208 * sent when the RX thread hasn't received anything for the 209 * below amount of time. 210 */ 211 #define ISCSI_DEFAULT_NOP_DELAY 5 /* seconds */ 212 extern int iscsi_nop_delay; 213 /* 214 * If we haven't received anything in a specified period of time 215 * we will stop accepting IO via tran start. This will enable 216 * upper level drivers to see we might be having a problem and 217 * in the case of scsi_vhci will start to route IO down a better 218 * path. 219 */ 220 #define ISCSI_DEFAULT_RX_WINDOW 20 /* seconds */ 221 extern int iscsi_rx_window; 222 /* 223 * If we haven't received anything in a specified period of time 224 * we will stop accepting IO via tran start. This the max limit 225 * when encountered we will start returning a fatal error. 226 */ 227 #define ISCSI_DEFAULT_RX_MAX_WINDOW 180 /* seconds */ 228 extern int iscsi_rx_max_window; This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss