Hello,

I have 6 identical physical machines in one cluster with Debian 6.0 onboard . 
Initially they were used to run Cassandra nodes, but these nodes started to go 
down randomly after several hours of work, with hung up connections in 
CLOSE_WAIT state. Typically, CLOSE_WAIT state is indicator of incorrect app 
behavior, but I've reproduced similar symptoms with netperf CRR test even with 
host as localhost:

'netperf -H localhost -t TCP_CRR -l -5' results in 

 

'TCP Connect/Request/Response TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
localhost (127.0.0.1) port 0 AF_INET : demo

send_tcp_conn_rr: data recv error: Connection reset by peer'

 

And connections connection hang up in CLOSE_WAIT state with strange 1 byte in 
Recv-Q:

 

'tcp 1 0 127.0.0.1:12865 127.0.0.1:39664 CLOSE_WAIT'

 

Though, if I set test duration in seconds (e.g. -l 5) it works correctly, and 
TCP_RR works correctly all the time.

Also, I've made tcpdump of conversation between two nodes in similar TCP_CRR 
test and it also looks strange. Nodes correctly open connection 'client' send 
its data and then  'server' side just resets connection.

 

'netstat -s' for 40 minutes of uptime(reboot, test, and writing this message) 
shows suspicious    '6 TCP data loss events' and '11 connections reset due to 
early user close':

 

Ip:

   2645347 total packets received

   76 with invalid addresses

   0 forwarded

   0 incoming packets discarded

   2645271 incoming packets delivered

   2636980 requests sent out

Icmp:

   22 ICMP messages received

   0 input ICMP message failed.

   ICMP input histogram:

       destination unreachable: 22

   22 ICMP messages sent

   0 ICMP messages failed

   ICMP output histogram:

       destination unreachable: 22

IcmpMsg:

       InType3: 22

       OutType3: 22

Tcp:

   263419 active connections openings

   263458 passive connection openings

   0 failed connection attempts

   62 connection resets received

   1 connections established

   2636459 segments received

   2636437 segments send out

   8 segments retransmited

   0 bad segments received.

   21 resets sent

Udp:

   531 packets received

   2 packets to unknown port received.

   0 packet receive errors

   553 packets sent

UdpLite:

TcpExt:

   9 invalid SYN cookies received

   264883 TCP sockets finished time wait in fast timer

   3 time wait sockets recycled by time stamp

   20 delayed acks sent

   Quick ack mode was activated 1 times

   264978 packets directly queued to recvmsg prequeue.

   473 bytes directly in process context from backlog

   265473 bytes directly received in process context from prequeue

   69 packet headers predicted

   1573 packets header predicted and directly queued to user

   1055284 acknowledgments not containing data payload received

   193 predicted acknowledgments

   6 TCP data loss events

   1 timeouts in loss state

   5 retransmits in slow start

   2 other TCP timeouts

   2 DSACKs sent for old packets

   11 connections reset due to early user close

   TCPSackMerged: 7

   TCPSackShiftFallback: 13

 

I've already upgraded 'ixgbe' driver upto the latest 3.9-NAPI, but problem 
still persists. And I even cannot find out it's source.

 

Best regards,

Anatoly Rybalchenko

 

Reply via email to