On Mon, Jan 22, 2018 at 10:30 AM, Ben Greear <gree...@candelatech.com> wrote: > On 01/22/2018 10:16 AM, Eric Dumazet wrote: >> >> On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote: >>> >>> My test case is to have 6 processes each create 5000 TCP IPv4 connections >>> to each other >>> on a system with 16GB RAM and send slow-speed data. This works fine on a >>> 4.7 kernel, but >>> will not work at all on a 4.13. The 4.13 first complains about running >>> out of tcp memory, >>> but even after forcing those values higher, the max connections we can >>> get is around 15k. >>> >>> Both kernels have my out-of-tree patches applied, so it is possible it is >>> my fault >>> at this point. >>> >>> Any suggestions as to what this might be caused by, or if it is fixed in >>> more recent kernels? >>> >>> I will start bisecting in the meantime... >>> >> >> Hi Ben >> >> Unfortunately I have no idea. >> >> Are you using loopback flows, or have I misunderstood you ? >> >> How loopback connections can be slow-speed ? >> > > I am sending to self, but over external network interfaces, by using > routing tables and rules and such. > > On 4.13.16+, I see the Intel driver bouncing when I try to start 20k > connections. In this case, I have a pair of 10G ports doing 15k, and then > I try to start 5k on two of the 1G ports.... > > Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is > Down > Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is > Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is > Down > Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is > Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is > Down > Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is > Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 22 10:15:43 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is > Down > Jan 22 10:15:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is > Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 > (e1000e): transmit queue 0 timed out, trans_s...es: 1 > Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 > eth3: Reset adapter unexpectedly >
Ben We had an interface doing this and grabbing these commits resolved it for us: 4aea7a5c5e94 e1000e: Avoid receiver overrun interrupt bursts 19110cfbb34d e1000e: Separate signaling for link check/link up d3509f8bc7b0 e1000e: Fix return value test 65a29da1f5fd e1000e: Fix wrong comment related to link detection c4c40e51f9c3 e1000e: Fix error path in link detection They are in the LTS kernels now, but don't believe they were when we first hit this problem. Josh