XDP programs read and/or write packet data very early, and cache miss is
seen to be a bottleneck.

Add prefetch logic in the xdp case 3 packets in the future. Throughput
improved from 10Mpps to 12.5Mpps.  LLC misses as reported by perf stat
reduced from ~14% to ~7%.  Prefetch values of 0 through 5 were compared
with >3 showing dimishing returns.

Before:
 21.94%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001d6e4
 12.96%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_process_rx_cq
 12.28%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_xmit_frame
 11.93%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_poll_tx_cq
  4.77%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_prepare_rx_desc
  3.13%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_free_tx_desc.isra.30
  2.68%  ksoftirqd/0  [kernel.vmlinux]  [k] bpf_map_lookup_elem
  2.22%  ksoftirqd/0  [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
  2.02%  ksoftirqd/0  [mlx4_core]       [k] mlx4_eq_int
  1.92%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_rx_recycle

After:
 20.70%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_xmit_frame
 18.14%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_process_rx_cq
 16.30%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_poll_tx_cq
  6.49%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_prepare_rx_desc
  4.06%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_free_tx_desc.isra.30
  2.76%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_rx_recycle
  2.37%  ksoftirqd/0  [mlx4_core]       [k] mlx4_eq_int
  1.44%  ksoftirqd/0  [kernel.vmlinux]  [k] bpf_map_lookup_elem
  1.43%  swapper      [kernel.vmlinux]  [k] intel_idle
  1.20%  ksoftirqd/0  [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
  1.19%  ksoftirqd/0  [mlx4_core]       [k] 0x0000000000049eb8

Signed-off-by: Brenden Blanco <bbla...@plumgrid.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 41c76fe..65e93f7 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -881,10 +881,17 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
mlx4_en_cq *cq, int bud
                 * read bytes but not past the end of the frag.
                 */
                if (prog) {
+                       struct mlx4_en_rx_alloc *pref;
                        struct xdp_buff xdp;
+                       int pref_index;
                        dma_addr_t dma;
                        u32 act;
 
+                       pref_index = (index + 3) & ring->size_mask;
+                       pref = ring->rx_info +
+                                       (pref_index << priv->log_rx_info);
+                       prefetch(page_address(pref->page) + pref->page_offset);
+
                        dma = be64_to_cpu(rx_desc->data[0].addr);
                        dma_sync_single_for_cpu(priv->ddev, dma,
                                                priv->frag_info[0].frag_size,
-- 
2.8.2

Reply via email to