Mike,

I'm still digging into this "iSCSI Disconnect" issue that we've been dealing 
with for about two years here in SUNY.  The iscsi-initiator that I'm using is 
this:

  iscsi-initiator-utils-6.2.0.872-6.0.2.el5

I've been running dt tests to the equallogic in our Oracle VM environment with 
the following set:
  
  echo 1 > /sys/module/libiscsi2/parameters/debug_libiscsi
  echo 1 > /sys/module/libiscsi_tcp/parameters/debug_libiscsi_tcp
  echo 1 > /sys/module/iscsi_tcp/parameters/debug_iscsi_tcp

When I try to look for "echo 1 > 
/sys/module/libiscsi/parameters/debug_libiscsi_eh" I don't see that as an 
available option.  Are the debug lines above correct?  Are there other debug 
messages that I should be gathering?

[root@oim61024001 src]# ls -l /sys/module/libiscsi
libiscsi2/    libiscsi_tcp/ 


We have done numerous tests and different hardware.  I have eliminated OVM and 
have been testing this now on OEL5.6.  

The scenario that seems to break connections quickly is the following:

- cluster of 5 nodes using OCFS2 (I will be trying to rule OCFS2 and 
dm-multipath out shortly)
- 5-7 iSCSI volumes connected to each of those 5 nodes
- 4 threads of dt running against each of 5-7 volumes from each host = 28 
threads of dt slamming the volumes per host = 140 threads of dt per cluster.
- test only lasts about 2-5 minutes before I start seeing ping timeouts and 
disconnects.

The issue that we've seen thus far is mainly with EqualLogic and open-iscsi.   
From what EQL is telling us, the initiator is "aborting" the connection.  But 
from the initiator-side, we just see "ping timeout" messages (and then the 
connection eventually goes away).   

We recently saw a thread (Apr 4) regarding cfq scheduler.  So we quickly tested 
noop and deadline, just to see if that would change anything-- it didn't.

So my most recent test was to try out a different target, just to see if we 
could rule out the EqualLogic.  Each time I changed from EQLX to the tgtd, I 
would reset (and rescan in my volumes) the iscsid.conf's "FastAbort = No", or 
yes (if I was testing tgtd), to conform with EQLX's best practices.

So at this point, after I get dm-multipath and OCFS2 out of the equation, it 
will be down to a tartget + kernel/initiator + I/O scheduler and I want to make 
sure that I'm getting all the debug information that I might need to analyze 
what is going on.

Are there any other debug tunables that you might recommend adding to my script?


On Apr 14, 2011, at 12:02 AM, Mike Christie wrote:

> On 04/12/2011 12:43 PM, Joe Hoot wrote:
>> I'm trying to understand the following messages:
>> 
>>   Apr 12 13:23:52 oim60025001 tgtd: conn_close(88) connection closed 
>> 0x94d80c4 1
>>   .... lots of the above messages...
>>   Apr 12 13:37:19 oim60025001 tgtd: abort_task_set(979) found 271 0
>>   .... lots of the above messages...
>>   Apr 12 13:37:27 oim60025001 tgtd: abort_cmd(955) found e9 e
>>   .... lots of the above messages...
>>   Apr 12 13:39:08 oim60025001 tgtd: conn_close(88) connection closed 
>> 0xa9ab8ec 1
>> 
>> Does that typically mean that the target has closed the connection or the 
>> initiator?
>> 
> 
> Might be best to ask the tgtd list, but I think due to the abort 
> messsages it is probably the initiator if you are using open-iscsi for 
> the initiator and if you see the abort messages before conn_close ones.
> 
> If you see abort ones first, then see conn close ones, then probably a 
> scsi command is timing out. This causes the scsi layer to have the 
> initiator abort the command. If the abort fails, the initiator could try 
> a lun reset or target reset. If we cannot reset or abort the problem 
> away we drop the connection/session.
> 
> On the initiator if you did
> 
> echo 1 > /sys/module/libiscsi/parameters/debug_libiscsi_eh
> 
> you would see a "wait for relogin" message in /var/log/messages then a 
> "session reset succeeded" or "failing session reset: Could not log back 
> into" message. The wait for relogin would match the conn close messages 
> on the target. Then the success or failed messages indicated if we were 
> able to relogin.
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "open-iscsi" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/open-iscsi?hl=en.
> 

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.

Reply via email to