Here's what I am seeing (roughly... with lines scraped in between.. and I'm only grepping for connect7/session7):
Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit ...lots of the above types of messages... Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_check_transport_timeouts Setting next tmo 2279563 Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit ...lots of the above types of messages... Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_check_transport_timeouts Setting next tmo 2279563 Apr 11 13:22:24 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit ...lots of the above types of messages... Apr 11 13:22:25 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit Apr 11 13:22:25 oim61024009 kernel: connection7:0: iscsi_check_transport_timeouts Sending nopout as ping Apr 11 13:22:25 oim61024009 kernel: connection7:0: iscsi_check_transport_timeouts Setting next tmo 2280813 Apr 11 13:22:25 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit ...lots of the above types of messages... Apr 11 13:22:30 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit Apr 11 13:22:30 oim61024009 kernel: connection7:0: ping timeout of 5 secs expired, recv timeout 5, last rx 2278313, last ping 2279563, now 2280813 Apr 11 13:22:30 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 2592 0 size 4096 xmit Apr 11 13:22:30 oim61024009 kernel: connection7:0: detected conn error (1011) Apr 11 13:22:30 oim61024009 kernel: session7: iscsi_data_xmit Tx suspended! Apr 11 13:22:30 oim61024009 kernel: session7: iscsi_start_session_recovery blocking session Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing pending sc df493a80 itt 0x108 ...etc..etc.. Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing pending sc d81de300 itt 0x159 Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing requeued sc d89b6800 itt 0xd8 Apr 11 13:22:30 oim61024009 kernel: connection7:0: iscsi_tcp_cleanup_task pending r2t dropped Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing requeued sc ca305ac0 itt 0xd9 Apr 11 13:22:30 oim61024009 kernel: connection7:0: iscsi_tcp_cleanup_task pending r2t dropped Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing requeued sc d81de440 itt 0xdb Apr 11 13:22:30 oim61024009 kernel: connection7:0: iscsi_tcp_cleanup_task pending r2t dropped Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing requeued sc ca305700 itt 0xd7 ...pattern continues.. Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing requeued sc d847ee40 itt 0xfa Apr 11 13:22:30 oim61024009 kernel: connection7:0: iscsi_tcp_cleanup_task pending r2t dropped Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing in progress sc c9ac8840 itt 0xfb Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing in progress sc d1e04800 itt 0xfc Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing in progress sc df0cdc00 itt 0xfd ...pattern continues... Apr 11 13:22:30 oim61024009 kernel: session7: fail_all_commands failing in progress sc d8811700 itt 0x107 Apr 11 13:22:30 oim61024009 kernel: session7: flush_control_queues flushing pending mgmt task itt 0x15a Apr 11 13:22:30 oim61024009 kernel: session7: iscsi2_queuecommand cmd 0x2a rejected (7) Apr 11 13:22:32 oim61024009 kernel: connection7:0: iscsi_tcp_hdr_recv_prep (digest disabled) Apr 11 13:22:32 oim61024009 kernel: connection7:0: iscsi_tcp_task_init mtask deq [itt 0x0] Apr 11 13:22:32 oim61024009 kernel: connection7:0: iscsi_sw_tcp_send_hdr_prep digest disabled Apr 11 13:22:32 oim61024009 kernel: connection7:0: iscsi_sw_tcp_send_linear_data_prep datalen=208 digest disabled Apr 11 13:22:32 oim61024009 kernel: session7: iscsi_prep_mgmt_task mgmtpdu [op 0x3 hdr->itt 0x10000000 datalen 208] Apr 11 13:22:32 oim61024009 kernel: connection7:0: iscsi_tcp_segment_done copied 0 0 size 48 xmit On Apr 14, 2011, at 6:45 AM, Mike Christie wrote: > On 04/14/2011 05:12 AM, Joe Hoot wrote: >> Mike, >> >> I'm still digging into this "iSCSI Disconnect" issue that we've been dealing >> with for about two years here in SUNY. The iscsi-initiator that I'm using >> is this: >> >> iscsi-initiator-utils-6.2.0.872-6.0.2.el5 >> >> I've been running dt tests to the equallogic in our Oracle VM environment >> with the following set: >> >> echo 1> /sys/module/libiscsi2/parameters/debug_libiscsi >> echo 1> /sys/module/libiscsi_tcp/parameters/debug_libiscsi_tcp >> echo 1> /sys/module/iscsi_tcp/parameters/debug_iscsi_tcp >> >> When I try to look for "echo 1> >> /sys/module/libiscsi/parameters/debug_libiscsi_eh" I don't see that as an >> available option. Are the debug lines above correct? Are there other debug >> messages that I should be gathering? >> > > It is only in newer kernels. debug_libiscsi will print out what > debug_libiscsi_eh does plus more. > > >> [root@oim61024001 src]# ls -l /sys/module/libiscsi >> libiscsi2/ libiscsi_tcp/ >> >> >> We have done numerous tests and different hardware. I have eliminated OVM >> and have been testing this now on OEL5.6. >> >> The scenario that seems to break connections quickly is the following: >> >> - cluster of 5 nodes using OCFS2 (I will be trying to rule OCFS2 and >> dm-multipath out shortly) >> - 5-7 iSCSI volumes connected to each of those 5 nodes >> - 4 threads of dt running against each of 5-7 volumes from each host = 28 >> threads of dt slamming the volumes per host = 140 threads of dt per cluster. >> - test only lasts about 2-5 minutes before I start seeing ping timeouts and >> disconnects. >> >> The issue that we've seen thus far is mainly with EqualLogic and open-iscsi. >> From what EQL is telling us, the initiator is "aborting" the connection. >> But from the initiator-side, we just see "ping timeout" messages (and then >> the connection eventually goes away). >> > > When the ping times out, the initiator will abort/drop the connection. > It will then try to relogin. So you should probably be seeing the > ping/nop timeout message, then a conn error 1011, then a message about > reconnected in X retries, right? > > > The thing is that below with tgtd we seem to be seeing scsi commands > timing out. On the initiator logs did you see ping/nop timeout messages > when you hit the problem below with tgtd? > > With tgtd let's take take the disks out of it and just test iscsi. > > So do something like this: > > tgtadm --op new --mode target --tid 1 -T iqn.2001-04.com.pertest > tgtadm --op new --mode logicalunit --tid 1 --lun 1 --bstype null -b > /dev/null > tgtadm --op bind --mode target --tid 1 -I ALL > > This of course will not do any real IO so you cannot do tests that > verify day. Just do really harsh heavy read/write tests. > > > When you do IO to fake disks do you see any ping/nop or scsi timeout errors? > > >> We recently saw a thread (Apr 4) regarding cfq scheduler. So we quickly >> tested noop and deadline, just to see if that would change anything-- it >> didn't. >> >> So my most recent test was to try out a different target, just to see if we >> could rule out the EqualLogic. Each time I changed from EQLX to the tgtd, I >> would reset (and rescan in my volumes) the iscsid.conf's "FastAbort = No", >> or yes (if I was testing tgtd), to conform with EQLX's best practices. >> > > The abort setting is not going to help at all for any target. That > basically kicks in after you have already hit the problem you are > hitting. It just speeds up the handling of the problem. > > >> So at this point, after I get dm-multipath and OCFS2 out of the equation, it >> will be down to a tartget + kernel/initiator + I/O scheduler and I want to >> make sure that I'm getting all the debug information that I might need to >> analyze what is going on. >> >> Are there any other debug tunables that you might recommend adding to my >> script? > > No, the ones you have above are all there is. Maybe tcpdump -w iscsi.out > -i ethXYZ > > Did you have the initiator /var/log/messages with debugging on when you > hit the problem below? If so send them. > >> >> On Apr 14, 2011, at 12:02 AM, Mike Christie wrote: >> >>> On 04/12/2011 12:43 PM, Joe Hoot wrote: >>>> I'm trying to understand the following messages: >>>> >>>> Apr 12 13:23:52 oim60025001 tgtd: conn_close(88) connection closed >>>> 0x94d80c4 1 >>>> .... lots of the above messages... >>>> Apr 12 13:37:19 oim60025001 tgtd: abort_task_set(979) found 271 0 >>>> .... lots of the above messages... >>>> Apr 12 13:37:27 oim60025001 tgtd: abort_cmd(955) found e9 e >>>> .... lots of the above messages... >>>> Apr 12 13:39:08 oim60025001 tgtd: conn_close(88) connection closed >>>> 0xa9ab8ec 1 >>>> > > -- > You received this message because you are subscribed to the Google Groups > "open-iscsi" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/open-iscsi?hl=en. > -- You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
