On Tue, Apr 19, 2016 at 10:12 AM, Edward Cree <ec...@solarflare.com> wrote: > On 19/04/16 16:46, Tom Herbert wrote: >> On Tue, Apr 19, 2016 at 7:50 AM, Eric Dumazet <eric.duma...@gmail.com> wrote: >>> We have hard time to deal with latencies already, and maintaining some >>> sanity in the stack(s) >> Right, this is significant complexity for a fairly narrow use case. > Why do you say the use case is narrow? This approach should increase > packet rate for any (non-GROed) traffic, whether for local delivery or > forwarding. If you're line-rate limited, it'll save CPU time instead. > The only reason I focused my testing on single-byte UDP is because the > benefits are more easily measured in that case. > It's a narrow use case because of the intent to "suggested that having multiple packets traverse the network stack together". Beyond queuing to the backlog I don't understand what more processing can be done without splitting the list up. We need to do a route lookup on each packet, need to run each through IP tables, need to deliver each packet individually to the application. For the queuing to backlog that seems to me to be more of a localized bulk enqueue/dequeue problem instead of a stack level infrastructure problem.
The general alternative to grouping packets together is to apply cached values that were found in lookups for previous "similar" packets. Since nearly all traffic fits some profile of a flow, we can leverage the point that packets in a flow should have similar lookup results. So, for example, the first time we see a flow we can create a flow state and save any results of lookups found for that packets in the flow (route lookup, IP tables etc.). For subsequent packets, if we match the flow then we have the answers for all the lookups we would need. Maintaining temporal flow states and performing fixed 5-tuple flow state lookups in the hash table is easy for a host (and we can often throw a lot of memory at it to size hash tables to avoid collisions). VLP matching, open ended rule chains, multi table lookups, crazy hashes over 35 fields in headers are the things we only want to do when there is no other recourse. This illustrates one reason why a host is not a switch, we have no hardware to do complex lookups. Tom > If anything, the use case is broader than GRO, because GRO can't be used > for datagram protocols where packet boundaries must be maintained. > And because the listified processing is at least partly sharing code with > the regular stack, it's less complexity than GRO which has to have > essentially its own receive stack, _and_ code to coalesce the results > back into a superframe. > > I think if we pushed bundled RX all the way up to the TCP layer, it might > potentially also be faster than GRO, because it avoids the work of > coalescing superframes; plus going through the GRO callbacks for each > packet could end up blowing icache in the same way the regular stack does. > If bundling did prove faster, we could then remove GRO, and overall > complexity would be _reduced_. > > But I admit it may be a long shot. > > -Ed