On Sat, 2013-01-05 at 22:49 +0100, Willy Tarreau wrote: > Hi, > > I'm observing multiple apparently unrelated network performance > issues in 3.7, to the point that I'm doubting it comes from the > network stack. > > My setup involves 3 machines connected point-to-point with myri > 10GE NICs (the middle machine has 2 NICs). The middle machine > normally runs haproxy, the other two run either an HTTP load > generator or a dummy web server : > > > [ client ] <--------> [ haproxy ] <--------> [ server ] > > Usually transferring HTTP objects from the server to the client > via haproxy causes no problem at 10 Gbps for moderately large > objects. > > This time I observed that it was not possible to go beyond 6.8 Gbps, > with all the chain idling a lot. I tried to change the IRQ rate, CPU > affinity, tcp_rmem/tcp_wmem, disabling flow control, etc... the usual > knobs, nothing managed to go beyond. > > So I removed haproxy from the equation, and simply started the client > on the middle machine. Same issue. I thought about concurrency issues, > so I reduced to a single connection, and nothing changed (usually I > achieve 10G even with a single connection with large enough TCP windows). > I tried to start tcpdump and the transfer immediately stalled and did not > come back after I stopped tcpdump. This was reproducible several times > but not always. > > So I first thought about an issue in the myri10ge driver and wanted to > confirm that everything was OK on the middle machine. > > I started the server on it and aimed the client at it via the loopback. > The transfer rate was even worse : randomly oscillating between 10 and > 100 MB/s ! Normally on the loop back, I get several GB/s here. > > Running tcpdump on the loopback showed be several very concerning issues : > > 1) lots of packets are lost before reaching tcpdump. The trace shows that > these segments are ACKed so they're correctly received, but tcpdump > does not get them. Tcpdump stats at the end report impressive numbers, > around 90% packet dropped from the capture! > > 2) ACKs seem to be immediately delivered but do not trigger sending, the > system seems to be running with delayed ACKs, as it waits 40 or 200ms > before restarting, and this is visible even in the first round trips : > > - connection setup : > > 18:32:08.071602 IP 127.0.0.1.26792 > 127.0.0.1.8000: S > 2036886615:2036886615(0) win 8030 <mss 65495,nop,nop,sackOK,nop,wscale 9> > 18:32:08.071605 IP 127.0.0.1.8000 > 127.0.0.1.26792: S > 126397113:126397113(0) ack 2036886616 win 8030 <mss > 65495,nop,nop,sackOK,nop,wscale 9> > 18:32:08.071614 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126397114 win 16 > > - GET /?s=1g HTTP/1.0 > > 18:32:08.071649 IP 127.0.0.1.26792 > 127.0.0.1.8000: P > 2036886616:2036886738(122) ack 126397114 win 16 > > - HTTP/1.1 200 OK with the beginning of the response : > > 18:32:08.071672 IP 127.0.0.1.8000 > 127.0.0.1.26792: . > 126397114:126401210(4096) ack 2036886738 win 16 > 18:32:08.071676 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126401210 win > 250 > ==> 200ms pause here > 18:32:08.275493 IP 127.0.0.1.8000 > 127.0.0.1.26792: P > 126401210:126463006(61796) ack 2036886738 win 16 > ==> 40ms pause here > 18:32:08.315493 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126463006 win > 256 > 18:32:08.315498 IP 127.0.0.1.8000 > 127.0.0.1.26792: . > 126463006:126527006(64000) ack 2036886738 win 16 > > ... and so on > > My server is using splice() with the SPLICE_F_MORE flag to send data. > I noticed that not using splice and relying on send(MSG_MORE) instead > I don't get the issue. > > 3) I wondered if this had something to do with the 64k MTU on the loopback > so I lowered it to 16kB. The performance was even worse (about 5MB/s). > Starting tcpdump managed to make my transfer stall, just like with the > myri10ge. In this last test, I noticed that there were some real drops, > because there were some SACKs : > > 18:45:16.699951 IP 127.0.0.1.8000 > 127.0.0.1.8002: P > 956153186:956169530(16344) ack 131668746 win 16 > 18:45:16.699956 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 956169530 win 64 > 18:45:16.904119 IP 127.0.0.1.8000 > 127.0.0.1.8002: P > 957035762:957052106(16344) ack 131668746 win 16 > 18:45:16.904122 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957052106 win 703 > 18:45:16.904124 IP 127.0.0.1.8000 > 127.0.0.1.8002: P > 957052106:957099566(47460) ack 131668746 win 16 > 18:45:17.108117 IP 127.0.0.1.8000 > 127.0.0.1.8002: P > 957402550:957418894(16344) ack 131668746 win 16 > 18:45:17.108119 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957418894 win > 1846 > 18:45:17.312115 IP 127.0.0.1.8000 > 127.0.0.1.8002: P > 957672806:957689150(16344) ack 131668746 win 16 > 18:45:17.312117 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957689150 win > 2902 > 18:45:17.516114 IP 127.0.0.1.8000 > 127.0.0.1.8002: P > 958962966:958979310(16344) ack 131668746 win 16 > 18:45:17.516116 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 958979310 win > 7941 > 18:45:17.516150 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 959503678 win > 9926 <nop,nop,sack 1 {959405614:959421958}> > 18:45:17.516151 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 959503678 win > 9926 <nop,nop,sack 1 {959421958:959438302}> > > Please note that the Myri card is running with the normal MTU of 1500, > jumbo frames were not used. > > From what I could test, only the Tx path seems affected, because after > rebooting the server on 3.5.7, I can transfer at 10 Gbps via the myri10ge > again to the 3.7.1 client. I tried to disable GSO, TSO, etc... but nothing > worked. Also, on 3.5.7, there are almost no drops between the kernel and > tcpdump. > > I really suspect that all these issues are related. Maybe something about > send buffers recycling, I don't know. The fact that tcpdump gets very few > packets also makes me think about something about memory allocation. And > that could perhaps explain why starting it causes the traffic to stall. > > Sadly, it will be hard to bisect this bug because I had a hard time running > 3.7-rc up to rc5 due to several bugs that affected this workload (and were > fortunately fixed since). > > I think it's better to only focus on the issue with the loopback since it's > easier to reproduce and involves less moving blocks. > > I've just tried to remove the SPLICE_F_MORE flag on the sender but it does > not change anything. > > If anyone has any idea of things to test to narrow the problem down, I'm > interested. I have not tried 3.8-rc2 yet, and reading the archives there > does not seem to be any similar reports. Note that I've been using 3.5.7 > as a fallback, but the same lab has run 3.6.6 without any issues 2 months > ago, so I'm really confident that this is a 3.7 regression. > > Regards, > Willy > > --
Hi Willy, another good finding during the week end ! ;) 1) This looks like interrupts are spreaded on multiple cpus, and this give Out Of Order problems with TCP stack. 2) Another possibility would be that Myri card/driver doesnt like very well high order pages. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/