On Thu, Aug 11, 2016 at 11:25 PM, <rcoch...@linutronix.de> wrote: > On Thu, Aug 11, 2016 at 06:21:26PM +0300, Jouni Malinen wrote: >> The test code looked like this in python: >> >> addr = (url.hostname, url.port) >> socks = {} >> for i in range(20): >> socks[i] = socket.socket(socket.AF_INET, socket.SOCK_STREAM, >> socket.IPPROTO_TCP) >> socks[i].connect(addr) > > You getting a timeout on TCP connect()? Isn't that timeout really > long, like 75 seconds or something?
Yes, it looks like a TCP connect() timeout. I use a significantly reduced timeout in the test scripts since they are run unattended and are supposed to terminate in reasonable amount of time.. That said, this specific test case should not really have used as short a timeout as it did, i.e., just one second. Interestingly enough, increasing that to just 1.1 seconds was enough to make the test case pass.. Looking at the time it takes to execute connect(), it is 1.02 - 1.08 seconds for the timing out (with timeout=1sec) case which is now at i=14 while all the other 19 calls take 0.0 seconds.. If I increase that 20 to 50, I get more of such about 1.03 second results at i=17, i=34, i=48.. Looking more at what exactly is happening at the TCP layer, this is likely related to the server behavior since listen() backlog is set to 10 and if there are 10 parallel connections, the last one if immediately closed before reading anything. Somehow this combination with this kernel patch applied makes one of the connect() calls take that surprisingly long 1.02 or so seconds. Looking at a sniffer capture (*), the three-way TCP connection goes through fine for the first 15 connect() calls, but the 15th one does not get a response to SYN. This SYN is the frame 47 in the capture file with srcport == 60802. There is no SYN,ACK for it. The about one second unexpected time for connect() comes from this, i.e., the connection is completed only after the client side does TCP retransmission of the SYN (frame #77) a second later and the server side replies with RST,ACK (frame #78). 2 0.039135 127.0.0.1 -> 127.0.0.1 TCP 74 60772 > 49152 [SYN] Seq=0 Win=43690 Len=0 MSS=65495 SACK_PERM=1 TSval=4294937755 TSecr=0 WS=64 3 0.039146 127.0.0.1 -> 127.0.0.1 TCP 74 49152 > 60772 [SYN, ACK] Seq=0 Ack=1 Win=43690 Len=0 MSS=65495 SACK_PERM=1 TSval=4294937755 TSecr=4294937755 WS=64 4 0.039156 127.0.0.1 -> 127.0.0.1 TCP 66 60772 > 49152 [ACK] Seq=1 Ack=1 Win=43712 Len=0 TSval=4294937755 TSecr=4294937755 ... 47 0.042559 127.0.0.1 -> 127.0.0.1 TCP 74 60802 > 49152 [SYN] Seq=0 Win=43690 Len=0 MSS=65495 SACK_PERM=1 TSval=4294937756 TSecr=0 WS=64 77 1.119943 127.0.0.1 -> 127.0.0.1 TCP 74 [TCP Retransmission] 60802 > 49152 [SYN] Seq=0 Win=43690 Len=0 MSS=65495 SACK_PERM=1 TSval=4294937864 TSecr=0 WS=64 78 1.119953 127.0.0.1 -> 127.0.0.1 TCP 54 49152 > 60802 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0 So it looks like the issue is in one of the SYN,ACK frames getting completely lost.. (*) http://w1.fi/p/tcp-lo.pcap > Can you provide a simple test case or explain in more detail how you > run your test? I would like to reproduce the issue it here. I'm not sure how to make a simple test case for this taken into account this seems to have some unknown timing dependencies.. A quick loop of 20 TCP socket() + connect() calls with a server side that does listen() with backlog 10 and non-blocking operations with 10th and following incoming sockets getting close() immediately is what the test case ends up doing, but whether a simple program doing that without all of python and wpa_supplicant processing giving suitable timing is unclear.. These files describe the test setup that I'm using to run this: http://w1.fi/cgit/hostap/plain/tests/hwsim/README http://w1.fi/cgit/hostap/plain/tests/hwsim/vm/README with the actual test case being executed with tests/hwsim/vm$ ./vm-run.sh ap_wps_er_http_proto - Jouni