On 19/04/16 16:46, Tom Herbert wrote: > On Tue, Apr 19, 2016 at 7:50 AM, Eric Dumazet <eric.duma...@gmail.com> wrote: >> We have hard time to deal with latencies already, and maintaining some >> sanity in the stack(s) > Right, this is significant complexity for a fairly narrow use case. Why do you say the use case is narrow? This approach should increase packet rate for any (non-GROed) traffic, whether for local delivery or forwarding. If you're line-rate limited, it'll save CPU time instead. The only reason I focused my testing on single-byte UDP is because the benefits are more easily measured in that case.
If anything, the use case is broader than GRO, because GRO can't be used for datagram protocols where packet boundaries must be maintained. And because the listified processing is at least partly sharing code with the regular stack, it's less complexity than GRO which has to have essentially its own receive stack, _and_ code to coalesce the results back into a superframe. I think if we pushed bundled RX all the way up to the TCP layer, it might potentially also be faster than GRO, because it avoids the work of coalescing superframes; plus going through the GRO callbacks for each packet could end up blowing icache in the same way the regular stack does. If bundling did prove faster, we could then remove GRO, and overall complexity would be _reduced_. But I admit it may be a long shot. -Ed