On Wed, 28 Jan 2026 09:30:20 -0800
Stephen Hemminger <[email protected]> wrote:

> Implement the single/dual/quad loop design pattern from FD.IO VPP to
> improve cache efficiency in the af_packet PMD receive path.
> 
> The original implementation processes packets one at a time in a simple
> loop, which can result in cache misses when accessing frame headers and
> packet data. The new implementation:
> 
> - Processes packets in batches of 4 (quad), 2 (dual), and 1 (single)
> - Prefetches next batch of frame headers while processing current batch
> - Prefetches packet data before memcpy to hide memory latency
> - Reduces loop overhead through partial unrolling
> 
> Two helper functions are introduced:
> - af_packet_get_frame(): Returns frame pointer at index with wraparound
> - af_packet_rx_one(): Common per-packet processing (mbuf alloc, memcpy,
>   VLAN handling, timestamp offload)
> 
> The quad loop checks availability of all 4 frames before processing,
> falling through to dual/single loops when fewer frames are ready. Early
> exit paths (out_advance1/2/3) ensure correct frame index tracking when
> mbuf allocation fails mid-batch.
> 
> Prefetch strategy:
> - Frame headers: prefetch N+4..N+7 while processing N..N+3
> - Packet data: prefetch at tp_mac offset before memcpy
> 
> This pattern is well-established in high-performance packet processing
> and should improve throughput by better utilizing CPU cache hierarchy,
> particularly beneficial when processing bursts of packets.
> 
> Signed-off-by: Stephen Hemminger <[email protected]>


This and previous proposal to prefetch have no impact on performance.
Rolled a simple perf test and all three versions come out the same.
The bottleneck is not here, probably at system call and copies now.

        Original        Prefetch        Quad/Dual
TX      1.427 Mpps      1.426 Mpps      1.426 Mpps

RX      0.529 Mpps      0.530 Mpps      0.533 Mpps
 loss   87.93%          87.98%          88.0%


        Original        Prefetch        Quad/Dual
TX      1.427 Mpps      1.426 Mpps      1.426 Mpps

RX      0.529 Mpps      0.530 Mpps      0.533 Mpps
 loss   87.93%          87.98%          88.0%


Will put the test in the next version of this series, and
drop this patch.

Reply via email to