Re: Optimizing instruction-cache, more packets at each stage

Eric Dumazet Thu, 21 Jan 2016 09:49:22 -0800

On Thu, 2016-01-21 at 08:38 -0800, Tom Herbert wrote:

> Sure, but the receive path is parallelized.


This is true for multiqueue processing, assuming you can dedicate many
cores to process RX.

>  Improving parallelism has
> continuously shown to have much more impact than attempting to
> optimize for cache misses. The primary goal is not to drive 100Gbps
> with 64 packets from a single CPU. It is one benchmark of many we
> should look at to measure efficiency of the data path, but I've yet to
> see any real workload that requires that...
> 
> Regardless of anything, we need to load packet headers into CPU cache
> to do protocol processing. I'm not sure I see how trying to defer that
> as long as possible helps except in cases where the packet is crossing
> CPU cache boundaries and can eliminate cache misses completely (not
> just move them around from one function to another).

Note that some user space use multiple core (or hyper threads) to
implement a pipeline, using a single RX queue.

One thread can handle one stage (device RX drain) and prefetch data into
shared L1/L2 (and/or shared L3 for pipelines with more than 2 threads)

The second thread process packets with headers already in L1/L2

This way, the ~100 ns (or even more if you also consider skb
allocations) penalty to bring packet headers do not hurt PPS.

Re: Optimizing instruction-cache, more packets at each stage

Reply via email to