Hi, Ruifeng Could we go further and implement loop inside the conditional? Like this: if (mlx5_mr_btree_len(&rxq->mr_ctrl.cache_bh) > 1) { for (i = 0; i < n; ++i) { void *buf_addr = elts[i]->buf_addr;
wq[i].addr = rte_cpu_to_be_64((uintptr_t)buf_addr + RTE_PKTMBUF_HEADROOM); wq[i].lkey = mlx5_rx_mb2mr(rxq, elts[i]); } } else { for (i = 0; i < n; ++i) { void *buf_addr = elts[i]->buf_addr; wq[i].addr = rte_cpu_to_be_64((uintptr_t)buf_addr + RTE_PKTMBUF_HEADROOM); } } What do you think? Also, we should check the performance on other archs is not affected. With best regards, Slava > -----Original Message----- > From: Ruifeng Wang <ruifeng.w...@arm.com> > Sent: Tuesday, June 1, 2021 11:31 > To: Raslan Darawsheh <rasl...@nvidia.com>; Matan Azrad > <ma...@nvidia.com>; Shahaf Shuler <shah...@nvidia.com>; Slava Ovsiienko > <viachesl...@nvidia.com> > Cc: dev@dpdk.org; jer...@marvell.com; n...@arm.com; > honnappa.nagaraha...@arm.com; Ruifeng Wang <ruifeng.w...@arm.com> > Subject: [PATCH 2/2] net/mlx5: reduce unnecessary memory access > > MR btree len is a constant during Rx replenish. > Moved retrieve of the value out of loop to reduce data loads. > Slight performance uplift was measured on N1SDP. > > Signed-off-by: Ruifeng Wang <ruifeng.w...@arm.com> > --- > drivers/net/mlx5/mlx5_rxtx_vec.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.c > b/drivers/net/mlx5/mlx5_rxtx_vec.c > index d5af2d91ff..fc7e2a7f41 100644 > --- a/drivers/net/mlx5/mlx5_rxtx_vec.c > +++ b/drivers/net/mlx5/mlx5_rxtx_vec.c > @@ -95,6 +95,7 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data > *rxq) > volatile struct mlx5_wqe_data_seg *wq = > &((volatile struct mlx5_wqe_data_seg *)rxq->wqes)[elts_idx]; > unsigned int i; > + uint16_t btree_len; > > if (n >= rxq->rq_repl_thresh) { > MLX5_ASSERT(n >= > MLX5_VPMD_RXQ_RPLNSH_THRESH(q_n)); > @@ -106,6 +107,8 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data > *rxq) > rxq->stats.rx_nombuf += n; > return; > } > + > + btree_len = mlx5_mr_btree_len(&rxq->mr_ctrl.cache_bh); > for (i = 0; i < n; ++i) { > void *buf_addr; > > @@ -119,8 +122,7 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data > *rxq) > wq[i].addr = rte_cpu_to_be_64((uintptr_t)buf_addr + > > RTE_PKTMBUF_HEADROOM); > /* If there's a single MR, no need to replace LKey. */ > - if (unlikely(mlx5_mr_btree_len(&rxq- > >mr_ctrl.cache_bh) > - > 1)) > + if (unlikely(btree_len > 1)) > wq[i].lkey = mlx5_rx_mb2mr(rxq, elts[i]); > } > rxq->rq_ci += n; > -- > 2.25.1