On 07/09/18 03:32, Eric Dumazet wrote: > Adding this complexity and icache pressure needs more experimental results. > What about RPC workloads (eg 100 concurrent netperf -t TCP_RR -- -r > 8000,8000 ) > > Thanks. Some more results. Note that the TCP_STREAM figures given in the cover letter were '-m 1450'; when I run that with '-m 8000' I hit line rate on my 10G NIC on both the old and new code. Also, these tests are still all with IRQs bound to a single core on the RX side. A further note: the Code Under Test is running on the netserver side (RX side for TCP_STREAM tests); the netperf side is running stock RHEL7u3 (kernel 3.10.0-514.el7.x86_64). This potentially matters more for the TCP_RR test as both sides have to receive data.
TCP_STREAM, 8000 bytes, GRO enabled (4 streams) old: 9.415 Gbit/s new: 9.417 Gbit/s (Welch p = 0.087, n₁ = n₂ = 3) There was however a noticeable reduction in *TX* CPU usage, of about 15%. I don't know why that should be (changes in ack timing, perhaps?) TCP_STREAM, 8000 bytes, GRO disabled (4 streams) old: 5.200 Gbit/s new: 5.839 Gbit/s (12.3% faster) (Welch p < 0.001, n₁ = n₂ = 6) TCP_RR, 8000 bytes, GRO enabled (100 streams) (FoM is one-way latency, 0.5 / tps) old: 855.833 us new: 862.033 us (0.7% slower) (Welch p = 0.040, n₁ = n₂ = 6) TCP_RR, 8000 bytes, GRO disabled (100 streams) old: 962.733 us new: 871.417 us (9.5% faster) (Welch p < 0.001, n₁ = n₂ = 6) Conclusion: with GRO on we pay a small but real RR penalty. With GRO off (thus also with traffic that can't be coalesced) we get a noticeable speed boost from being able to use netif_receive_skb_list_internal(). -Ed