Hello, I have 6 identical physical machines in one cluster with Debian 6.0 onboard . Initially they were used to run Cassandra nodes, but these nodes started to go down randomly after several hours of work, with hung up connections in CLOSE_WAIT state. Typically, CLOSE_WAIT state is indicator of incorrect app behavior, but I've reproduced similar symptoms with netperf CRR test even with host as localhost: 'netperf -H localhost -t TCP_CRR -l -5' results in
'TCP Connect/Request/Response TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost (127.0.0.1) port 0 AF_INET : demo send_tcp_conn_rr: data recv error: Connection reset by peer' And connections connection hang up in CLOSE_WAIT state with strange 1 byte in Recv-Q: 'tcp 1 0 127.0.0.1:12865 127.0.0.1:39664 CLOSE_WAIT' Though, if I set test duration in seconds (e.g. -l 5) it works correctly, and TCP_RR works correctly all the time. Also, I've made tcpdump of conversation between two nodes in similar TCP_CRR test and it also looks strange. Nodes correctly open connection 'client' send its data and then 'server' side just resets connection. 'netstat -s' for 40 minutes of uptime(reboot, test, and writing this message) shows suspicious '6 TCP data loss events' and '11 connections reset due to early user close': Ip: 2645347 total packets received 76 with invalid addresses 0 forwarded 0 incoming packets discarded 2645271 incoming packets delivered 2636980 requests sent out Icmp: 22 ICMP messages received 0 input ICMP message failed. ICMP input histogram: destination unreachable: 22 22 ICMP messages sent 0 ICMP messages failed ICMP output histogram: destination unreachable: 22 IcmpMsg: InType3: 22 OutType3: 22 Tcp: 263419 active connections openings 263458 passive connection openings 0 failed connection attempts 62 connection resets received 1 connections established 2636459 segments received 2636437 segments send out 8 segments retransmited 0 bad segments received. 21 resets sent Udp: 531 packets received 2 packets to unknown port received. 0 packet receive errors 553 packets sent UdpLite: TcpExt: 9 invalid SYN cookies received 264883 TCP sockets finished time wait in fast timer 3 time wait sockets recycled by time stamp 20 delayed acks sent Quick ack mode was activated 1 times 264978 packets directly queued to recvmsg prequeue. 473 bytes directly in process context from backlog 265473 bytes directly received in process context from prequeue 69 packet headers predicted 1573 packets header predicted and directly queued to user 1055284 acknowledgments not containing data payload received 193 predicted acknowledgments 6 TCP data loss events 1 timeouts in loss state 5 retransmits in slow start 2 other TCP timeouts 2 DSACKs sent for old packets 11 connections reset due to early user close TCPSackMerged: 7 TCPSackShiftFallback: 13 I've already upgraded 'ixgbe' driver upto the latest 3.9-NAPI, but problem still persists. And I even cannot find out it's source. Best regards, Anatoly Rybalchenko -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/681e33e82bbf69408861eb4178b371bb054...@smc-ex1.enkata.com