On 04/14/2011 05:12 AM, Joe Hoot wrote:
Mike,

I'm still digging into this "iSCSI Disconnect" issue that we've been dealing 
with for about two years here in SUNY.  The iscsi-initiator that I'm using is this:

   iscsi-initiator-utils-6.2.0.872-6.0.2.el5

I've been running dt tests to the equallogic in our Oracle VM environment with 
the following set:

   echo 1>  /sys/module/libiscsi2/parameters/debug_libiscsi
   echo 1>  /sys/module/libiscsi_tcp/parameters/debug_libiscsi_tcp
   echo 1>  /sys/module/iscsi_tcp/parameters/debug_iscsi_tcp

When I try to look for "echo 1>  
/sys/module/libiscsi/parameters/debug_libiscsi_eh" I don't see that as an available 
option.  Are the debug lines above correct?  Are there other debug messages that I should be 
gathering?


It is only in newer kernels. debug_libiscsi will print out what debug_libiscsi_eh does plus more.


[root@oim61024001 src]# ls -l /sys/module/libiscsi
libiscsi2/    libiscsi_tcp/


We have done numerous tests and different hardware.  I have eliminated OVM and 
have been testing this now on OEL5.6.

The scenario that seems to break connections quickly is the following:

- cluster of 5 nodes using OCFS2 (I will be trying to rule OCFS2 and 
dm-multipath out shortly)
- 5-7 iSCSI volumes connected to each of those 5 nodes
- 4 threads of dt running against each of 5-7 volumes from each host = 28 
threads of dt slamming the volumes per host = 140 threads of dt per cluster.
- test only lasts about 2-5 minutes before I start seeing ping timeouts and 
disconnects.

The issue that we've seen thus far is mainly with EqualLogic and open-iscsi.   From what EQL is 
telling us, the initiator is "aborting" the connection.  But from the initiator-side, we 
just see "ping timeout" messages (and then the connection eventually goes away).


When the ping times out, the initiator will abort/drop the connection. It will then try to relogin. So you should probably be seeing the ping/nop timeout message, then a conn error 1011, then a message about reconnected in X retries, right?


The thing is that below with tgtd we seem to be seeing scsi commands timing out. On the initiator logs did you see ping/nop timeout messages when you hit the problem below with tgtd?

With tgtd let's take take the disks out of it and just test iscsi.

So do something like this:

tgtadm --op new --mode target --tid 1 -T iqn.2001-04.com.pertest
tgtadm --op new --mode logicalunit --tid 1 --lun 1 --bstype null -b /dev/null
tgtadm --op bind --mode target --tid 1 -I ALL

This of course will not do any real IO so you cannot do tests that verify day. Just do really harsh heavy read/write tests.


When you do IO to fake disks do you see any ping/nop or scsi timeout errors?


We recently saw a thread (Apr 4) regarding cfq scheduler.  So we quickly tested 
noop and deadline, just to see if that would change anything-- it didn't.

So my most recent test was to try out a different target, just to see if we could rule 
out the EqualLogic.  Each time I changed from EQLX to the tgtd, I would reset (and rescan 
in my volumes) the iscsid.conf's "FastAbort = No", or yes (if I was testing 
tgtd), to conform with EQLX's best practices.


The abort setting is not going to help at all for any target. That basically kicks in after you have already hit the problem you are hitting. It just speeds up the handling of the problem.


So at this point, after I get dm-multipath and OCFS2 out of the equation, it 
will be down to a tartget + kernel/initiator + I/O scheduler and I want to make 
sure that I'm getting all the debug information that I might need to analyze 
what is going on.

Are there any other debug tunables that you might recommend adding to my script?

No, the ones you have above are all there is. Maybe tcpdump -w iscsi.out -i ethXYZ

Did you have the initiator /var/log/messages with debugging on when you hit the problem below? If so send them.


On Apr 14, 2011, at 12:02 AM, Mike Christie wrote:

On 04/12/2011 12:43 PM, Joe Hoot wrote:
I'm trying to understand the following messages:

   Apr 12 13:23:52 oim60025001 tgtd: conn_close(88) connection closed 0x94d80c4 
1
   .... lots of the above messages...
   Apr 12 13:37:19 oim60025001 tgtd: abort_task_set(979) found 271 0
   .... lots of the above messages...
   Apr 12 13:37:27 oim60025001 tgtd: abort_cmd(955) found e9 e
   .... lots of the above messages...
   Apr 12 13:39:08 oim60025001 tgtd: conn_close(88) connection closed 0xa9ab8ec 
1


--
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.

Reply via email to