On Fri, 11 Mar 2016 15:39:47 +0200 Saeed Mahameed <sae...@mellanox.com> wrote:

> From: Tariq Toukan <tar...@mellanox.com>
> 
> Introduce the feature of multi-packet WQE (RX Work Queue Element)
> referred to as (MPWQE or Striding RQ), in which WQEs are larger
> and serve multiple packets each.
> 
> Every WQE consists of many strides of the same size, every received
> packet is aligned to a beginning of a stride and is written to
> consecutive strides within a WQE.

I really like this HW support! :-)

I noticed the "Multi-Packet WQE" send format, but I could not find the
receive part in the programmers ref doc, until I started looking after
"stride".


> In the regular approach, each regular WQE is big enough to be capable
> of serving one received packet of any size up to MTU or 64K in case of
> device LRO is enabeled, making it very wasteful when dealing with
> small packets or device LRO is enabeled.
> 
> For its flexibility, MPWQE allows a better memory utilization (implying
> improvements in CPU utilization and packet rate) as packets consume
> strides according to their size, preserving the rest of the WQE to be
> available for other packets.

It does allow significant better memory utilization (even if Eric
cannot see it, I can).

One issue with this approach is that we no-longer can use the
packet-data as the skb->data pointer.  (AFAIK because we cannot use
dma_unmap any longer, and instead we need to use dma_sync).

Thus, for every single packet you are now allocating a new memory area
for skb->data.


> MPWQE default configuration:
>       NUM WQEs = 16
>       Strides Per WQE = 1024
>       Stride Size = 128

> Performance tested on ConnectX4-Lx 50G.
> 
> * Netperf single TCP stream:
> - message size = 1024,  bw raised from ~12300 mbps to 14900 mbps (+20%)
> - message size = 65536, bw raised from ~21800 mbps to 33500 mbps (+50%)
> - with other message sized we saw some gain or no degradation.
> 
> * Netperf multi TCP stream:
> - No degradation, line rate reached.
> 
> * Pktgen: packet loss in bursts of N small messages (64byte), single
> stream
> - | num packets | packets loss before | packets loss after
>   |   2K      |       ~ 1K            |       0
>   |   16K     |       ~13K            |       0
>   |   32K     |       ~29K            |      14K
> 
> As expected as the driver can recive as many small packets (<=128) as
> the number of total strides in the ring (default = 1024 * 16) vs. 1024
> (default ring size regardless of packets size) before this feautre.
> 
> Signed-off-by: Tariq Toukan <tar...@mellanox.com>
> Signed-off-by: Achiad Shochat <ach...@mellanox.com>
> Signed-off-by: Saeed Mahameed <sae...@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en.h       |   71 +++++++++++-
>  .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |   15 ++-
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  109 +++++++++++++----
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    |  126 
> ++++++++++++++++++--
>  include/linux/mlx5/device.h                        |   39 ++++++-
>  include/linux/mlx5/mlx5_ifc.h                      |   13 ++-
>  6 files changed, 327 insertions(+), 46 deletions(-)
> 
[...]
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -76,6 +76,33 @@ err_free_skb:
>       return -ENOMEM;
>  }
>  
> +int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 
> ix)
> +{
> +     struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
> +     int ret = 0;
> +
> +     wi->dma_info.page = alloc_pages(GFP_ATOMIC | __GFP_COMP | __GFP_COLD,
> +                                     MLX5_MPWRQ_WQE_PAGE_ORDER);

Order 5 page = 131072 bytes, but we only alloc 16 of them.

> +     if (unlikely(!wi->dma_info.page))
> +             return -ENOMEM;
> +
> +     wi->dma_info.addr = dma_map_page(rq->pdev, wi->dma_info.page, 0,
> +                                      rq->wqe_sz, PCI_DMA_FROMDEVICE);

Mapping the entire page is going to make PowerPC owners happy.

> +     if (dma_mapping_error(rq->pdev, wi->dma_info.addr)) {
> +             ret = -ENOMEM;
> +             goto err_put_page;
> +     }
> +
> +     wi->consumed_strides = 0;
> +     wqe->data.addr = cpu_to_be64(wi->dma_info.addr);
> +
> +     return 0;
> +
> +err_put_page:
> +     put_page(wi->dma_info.page);
> +     return ret;
> +}
> +
[...]
> +void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
> +{
> +     u16 cstrides       = mpwrq_get_cqe_consumed_strides(cqe);
> +     u16 stride_ix      = mpwrq_get_cqe_stride_index(cqe);
> +     u32 consumed_bytes = cstrides  * MLX5_MPWRQ_STRIDE_SIZE;
> +     u32 stride_offset  = stride_ix * MLX5_MPWRQ_STRIDE_SIZE;
> +     u16 wqe_id         = be16_to_cpu(cqe->wqe_id);
> +     struct mlx5e_mpw_info *wi = &rq->wqe_info[wqe_id];
> +     struct mlx5e_rx_wqe  *wqe = mlx5_wq_ll_get_wqe(&rq->wq, wqe_id);
> +     struct sk_buff *skb;
> +     u16 byte_cnt;
> +     u16 cqe_bcnt;
> +     u16 headlen;
> +
> +     wi->consumed_strides += cstrides;

Ok, moving N strides, for next round.

> +
> +     if (unlikely((cqe->op_own >> 4) != MLX5_CQE_RESP_SEND)) {
> +             rq->stats.wqe_err++;
> +             goto mpwrq_cqe_out;
> +     }
> +
> +     if (mpwrq_is_filler_cqe(cqe)) {
> +             rq->stats.mpwqe_filler++;
> +             goto mpwrq_cqe_out;
> +     }
> +
> +     skb = netdev_alloc_skb(rq->netdev, MLX5_MPWRQ_SMALL_PACKET_THRESHOLD);
> +     if (unlikely(!skb))
> +             goto mpwrq_cqe_out;
> +
> +     dma_sync_single_for_cpu(rq->pdev, wi->dma_info.addr + stride_offset,
> +                             consumed_bytes, DMA_FROM_DEVICE);
> +
> +     cqe_bcnt = mpwrq_get_cqe_byte_cnt(cqe);
> +     headlen = min_t(u16, MLX5_MPWRQ_SMALL_PACKET_THRESHOLD, cqe_bcnt);
> +     skb_copy_to_linear_data(skb,
> +                             page_address(wi->dma_info.page) + stride_offset,
> +                             headlen);
> +     skb_put(skb, headlen);
> +
> +     byte_cnt = cqe_bcnt - headlen;
> +     if (byte_cnt) {
> +             skb_frag_t *f0 = &skb_shinfo(skb)->frags[0];
> +
> +             skb_shinfo(skb)->nr_frags = 1;
> +
> +             skb->data_len  = byte_cnt;
> +             skb->len      += byte_cnt;
> +             skb->truesize  = SKB_TRUESIZE(skb->len);
> +
> +             get_page(wi->dma_info.page);
> +             skb_frag_set_page(skb, 0, wi->dma_info.page);
> +             skb_frag_size_set(f0, skb->data_len);
> +             f0->page_offset = stride_offset + headlen;
> +     }
> +
> +     mlx5e_complete_rx_cqe(rq, cqe, cqe_bcnt, skb);
> +
> +mpwrq_cqe_out:
> +     if (likely(wi->consumed_strides < MLX5_MPWRQ_NUM_STRIDES))
> +             return;

Due to return statement, we keep working on the same big page, only
dma_sync'ing what we need.

> +
> +     dma_unmap_page(rq->pdev, wi->dma_info.addr, rq->wqe_sz,
> +                    PCI_DMA_FROMDEVICE);

Page is first fully dma_unmap'ed after all stride-entries have been
processed/consumed.

> +     put_page(wi->dma_info.page);
> +     mlx5_wq_ll_pop(&rq->wq, cqe->wqe_id, &wqe->next.next_wqe_index);
> +}
> +
>  int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
>  {
>       struct mlx5e_rq *rq = container_of(cq, struct mlx5e_rq, cq);



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

Reply via email to