On Thu, 2016-01-21 at 08:38 -0800, Tom Herbert wrote: > Sure, but the receive path is parallelized.
This is true for multiqueue processing, assuming you can dedicate many cores to process RX. > Improving parallelism has > continuously shown to have much more impact than attempting to > optimize for cache misses. The primary goal is not to drive 100Gbps > with 64 packets from a single CPU. It is one benchmark of many we > should look at to measure efficiency of the data path, but I've yet to > see any real workload that requires that... > > Regardless of anything, we need to load packet headers into CPU cache > to do protocol processing. I'm not sure I see how trying to defer that > as long as possible helps except in cases where the packet is crossing > CPU cache boundaries and can eliminate cache misses completely (not > just move them around from one function to another). Note that some user space use multiple core (or hyper threads) to implement a pipeline, using a single RX queue. One thread can handle one stage (device RX drain) and prefetch data into shared L1/L2 (and/or shared L3 for pipelines with more than 2 threads) The second thread process packets with headers already in L1/L2 This way, the ~100 ns (or even more if you also consider skb allocations) penalty to bring packet headers do not hurt PPS.