RE: [EXT] [PATCH v8] app/dma-perf: introduce dma-perf application

Jiang, Cheng1 Sat, 24 Jun 2023 04:53:01 -0700

Hi Anoob,

Replies are inline.


Thanks,
Cheng

> -----Original Message-----
> From: Anoob Joseph <ano...@marvell.com>
> Sent: Friday, June 23, 2023 2:53 PM
> To: Jiang, Cheng1 <cheng1.ji...@intel.com>; tho...@monjalon.net;
> Richardson, Bruce <bruce.richard...@intel.com>;
> m...@smartsharesystems.com; Xia, Chenbo <chenbo....@intel.com>; Amit
> Prakash Shukla <amitpraka...@marvell.com>; huangdeng...@huawei.com;
> Laatz, Kevin <kevin.la...@intel.com>; fengcheng...@huawei.com; Jerin
> Jacob Kollanukkaran <jer...@marvell.com>
> Cc: dev@dpdk.org; Hu, Jiayu <jiayu...@intel.com>; Ding, Xuan
> <xuan.d...@intel.com>; Ma, WenwuX <wenwux...@intel.com>; Wang,
> YuanX <yuanx.w...@intel.com>; He, Xingguang <xingguang...@intel.com>;
> Ling, WeiX <weix.l...@intel.com>
> Subject: RE: [EXT] [PATCH v8] app/dma-perf: introduce dma-perf application
> 
> Hi Cheng,
> 
> Thanks for the new version. Please see inline.
> 
> Thanks,
> Anoob
> 
> > -----Original Message-----
> > From: Cheng Jiang <cheng1.ji...@intel.com>
> > Sent: Tuesday, June 20, 2023 12:24 PM
> > To: tho...@monjalon.net; bruce.richard...@intel.com;
> > m...@smartsharesystems.com; chenbo....@intel.com; Amit Prakash Shukla
> > <amitpraka...@marvell.com>; Anoob Joseph <ano...@marvell.com>;
> > huangdeng...@huawei.com; kevin.la...@intel.com;
> > fengcheng...@huawei.com; Jerin Jacob Kollanukkaran
> > <jer...@marvell.com>
> > Cc: dev@dpdk.org; jiayu...@intel.com; xuan.d...@intel.com;
> > wenwux...@intel.com; yuanx.w...@intel.com; xingguang...@intel.com;
> > weix.l...@intel.com; Cheng Jiang <cheng1.ji...@intel.com>
> > Subject: [EXT] [PATCH v8] app/dma-perf: introduce dma-perf application
> >
> > External Email
> >
> > ----------------------------------------------------------------------
> > There are many high-performance DMA devices supported in DPDK now,
> and
> > these DMA devices can also be integrated into other modules of DPDK as
> > accelerators, such as Vhost. Before integrating DMA into applications,
> > developers need to know the performance of these DMA devices in
> > various scenarios and the performance of CPUs in the same scenario,
> > such as different buffer lengths. Only in this way can we know the
> > target performance of the application accelerated by using them. This
> > patch introduces a high-performance testing tool, which supports
> > comparing the performance of CPU and DMA in different scenarios
> > automatically with a pre- set config file. Memory Copy performance test
> are supported for now.
> >
> > Signed-off-by: Cheng Jiang <cheng1.ji...@intel.com>
> > Signed-off-by: Jiayu Hu <jiayu...@intel.com>
> > Signed-off-by: Yuan Wang <yuanx.w...@intel.com>
> > Acked-by: Morten Brørup <m...@smartsharesystems.com>
> > Acked-by: Chenbo Xia <chenbo....@intel.com>
> > ---
> > v8:
> >   fixed string copy issue in parse_lcore();
> >   improved some data display format;
> >   added doc in doc/guides/tools;
> >   updated release notes;
> >
> > v7:
> >   fixed some strcpy issues;
> >   removed cache setup in calling rte_pktmbuf_pool_create();
> >   fixed some typos;
> >   added some memory free and null set operations;
> >   improved result calculation;
> > v6:
> >   improved code based on Anoob's comments;
> >   fixed some code structure issues;
> > v5:
> >   fixed some LONG_LINE warnings;
> > v4:
> >   fixed inaccuracy of the memory footprint display;
> > v3:
> >   fixed some typos;
> > v2:
> >   added lcore/dmadev designation;
> >   added error case process;
> >   removed worker_threads parameter from config.ini;
> >   improved the logs;
> >   improved config file;
> >
> >  app/meson.build                        |   1 +
> >  app/test-dma-perf/benchmark.c          | 498 +++++++++++++++++++++
> >  app/test-dma-perf/config.ini           |  61 +++
> >  app/test-dma-perf/main.c               | 594 +++++++++++++++++++++++++
> >  app/test-dma-perf/main.h               |  69 +++
> >  app/test-dma-perf/meson.build          |  17 +
> >  doc/guides/rel_notes/release_23_07.rst |   6 +
> >  doc/guides/tools/dmaperf.rst           | 103 +++++
> >  doc/guides/tools/index.rst             |   1 +
> >  9 files changed, 1350 insertions(+)
> >  create mode 100644 app/test-dma-perf/benchmark.c  create mode 100644
> > app/test-dma-perf/config.ini  create mode 100644 app/test-dma-
> > perf/main.c  create mode 100644 app/test-dma-perf/main.h  create mode
> > 100644 app/test-dma-perf/meson.build  create mode 100644
> > doc/guides/tools/dmaperf.rst
> >
> 
> <snip>
> 
> > +/* Configuration of device. */
> > +static void
> > +configure_dmadev_queue(uint32_t dev_id, uint32_t ring_size) {
> > +   uint16_t vchan = 0;
> > +   struct rte_dma_info info;
> > +   struct rte_dma_conf dev_config = { .nb_vchans = 1 };
> > +   struct rte_dma_vchan_conf qconf = {
> > +           .direction = RTE_DMA_DIR_MEM_TO_MEM,
> > +           .nb_desc = ring_size
> > +   };
> > +
> > +   if (rte_dma_configure(dev_id, &dev_config) != 0)
> > +           rte_exit(EXIT_FAILURE, "Error with dma configure.\n");
> > +
> > +   if (rte_dma_vchan_setup(dev_id, vchan, &qconf) != 0)
> > +           rte_exit(EXIT_FAILURE, "Error with queue configuration.\n");
> > +
> > +   rte_dma_info_get(dev_id, &info);
> 
> [Anoob] This API can return errors. Better to add handling.

[Cheng] Sure, I'll fix it in the next version.

> 
> > +   if (info.nb_vchans != 1)
> > +           rte_exit(EXIT_FAILURE, "Error, no configured queues
> > reported on device id. %u\n",
> > +                           dev_id);
> > +
> > +   if (rte_dma_start(dev_id) != 0)
> > +           rte_exit(EXIT_FAILURE, "Error with dma start.\n"); }
> > +
> > +static int
> > +config_dmadevs(struct test_configure *cfg) {
> > +   uint32_t ring_size = cfg->ring_size.cur;
> > +   struct lcore_dma_map_t *ldm = &cfg->lcore_dma_map;
> > +   uint32_t nb_workers = ldm->cnt;
> > +   uint32_t i;
> > +   int dev_id;
> > +   uint16_t nb_dmadevs = 0;
> > +   char *dma_name;
> > +
> > +   for (i = 0; i < ldm->cnt; i++) {
> > +           dma_name = ldm->dma_names[i];
> > +           dev_id = rte_dma_get_dev_id_by_name(dma_name);
> > +           if (dev_id == -1) {
> 
> [Anoob] Can you check the above API definition? I think it returns not just -1
> in case of errors.

[Cheng] Yes, you are right, I'll fix it in the next version. Thanks a lot.

> 
> > +                   fprintf(stderr, "Error: Fail to find DMA %s.\n",
> > dma_name);
> > +                   goto end;
> > +           }
> > +
> > +           ldm->dma_ids[i] = dev_id;
> > +           configure_dmadev_queue(dev_id, ring_size);
> > +           ++nb_dmadevs;
> > +   }
> > +
> > +end:
> > +   if (nb_dmadevs < nb_workers) {
> > +           printf("Not enough dmadevs (%u) for all workers (%u).\n",
> > nb_dmadevs, nb_workers);
> > +           return -1;
> > +   }
> > +
> > +   printf("Number of used dmadevs: %u.\n", nb_dmadevs);
> > +
> > +   return 0;
> > +}
> > +
> > +static inline void
> > +do_dma_submit_and_poll(uint16_t dev_id, uint64_t *async_cnt,
> > +                   volatile struct worker_info *worker_info) {
> > +   int ret;
> > +   uint16_t nr_cpl;
> > +
> > +   ret = rte_dma_submit(dev_id, 0);
> > +   if (ret < 0) {
> > +           rte_dma_stop(dev_id);
> > +           rte_dma_close(dev_id);
> > +           rte_exit(EXIT_FAILURE, "Error with dma submit.\n");
> > +   }
> > +
> > +   nr_cpl = rte_dma_completed(dev_id, 0, MAX_DMA_CPL_NB, NULL,
> > NULL);
> > +   *async_cnt -= nr_cpl;
> > +   worker_info->total_cpl += nr_cpl;
> > +}
> > +
> > +static inline int
> > +do_dma_mem_copy(void *p)
> 
> [Anoob] Just curious, why not pass struct lcore_params *para itself? Is it
> because the pointer is volatile? If yes, then we can take an AI to split the
> struct into volatile and non-volatile parts.

[Cheng] The reason I did it this way is because I want to launch this function 
on another core by spawning a new thread, and rte_eal_remote_launch() takes a 
void * as the parameter. That's why I passed void *p.  Your suggestion to split 
the struct into volatile and non-volatile parts is quite reasonable. I am 
thinking about the best way to implement it. Thanks.

> 
> > +{
> > +   const uint16_t *para_idx = (uint16_t *)p;
> > +   volatile struct lcore_params *para = lcores_p[*para_idx].v_ptr;
> > +   volatile struct worker_info *worker_info = &(para->worker_info);
> > +   const uint16_t dev_id = para->dev_id;
> > +   const uint32_t nr_buf = para->nr_buf;
> > +   const uint16_t kick_batch = para->kick_batch;
> > +   const uint32_t buf_size = para->buf_size;
> > +   struct rte_mbuf **srcs = para->srcs;
> > +   struct rte_mbuf **dsts = para->dsts;
> > +   uint16_t nr_cpl;
> > +   uint64_t async_cnt = 0;
> > +   uint32_t i;
> > +   uint32_t poll_cnt = 0;
> > +   int ret;
> > +
> > +   worker_info->stop_flag = false;
> > +   worker_info->ready_flag = true;
> > +
> > +   while (!worker_info->start_flag)
> > +           ;
> > +
> > +   while (1) {
> > +           for (i = 0; i < nr_buf; i++) {
> > +dma_copy:
> > +                   ret = rte_dma_copy(dev_id, 0,
> > rte_pktmbuf_iova(srcs[i]),
> > +                           rte_pktmbuf_iova(dsts[i]), buf_size, 0);
> 
> [Anoob] Do we need to use ' rte_mbuf_data_iova' here instead of
> 'rte_pktmbuf_iova'?

[Cheng] yes rte_mbuf_data_iova is more appropriate, I'll fix it in the next 
version. Thanks.

> 
> > +                   if (unlikely(ret < 0)) {
> > +                           if (ret == -ENOSPC) {
> > +                                   do_dma_submit_and_poll(dev_id,
> > &async_cnt, worker_info);
> > +                                   goto dma_copy;
> > +                           } else {
> > +                                   /* Error exit */
> > +                                   rte_dma_stop(dev_id);
> 
> [Anoob] Missing rte_dma_close() here. Also, may be introduce a static void
> function so that rte_exit call etc won't be part of the fastpath loop.
> 
> May be something like below and you can call it from here and
> "do_dma_submit_and_poll".
> 
> static void
> error_exit(int dev_id)
> {
>       /* Error exit */
>       rte_dma_stop(dev_id);
>       rte_dma_close(dev_id);
>       rte_exit(EXIT_FAILURE, "DMA enqueue failed\n"); }
> 

[Cheng] I'm not so sure here. rte_dma_close() is called in the rte_exit(). Do 
we still call it explicitly before rte_exit()?

> > +                                   rte_exit(EXIT_FAILURE, "DMA
> > enqueue failed\n");
> > +                           }
> > +                   }
> > +                   async_cnt++;
> > +
> > +                   if ((async_cnt % kick_batch) == 0)
> > +                           do_dma_submit_and_poll(dev_id,
> > &async_cnt, worker_info);
> > +           }
> > +
> > +           if (worker_info->stop_flag)
> > +                   break;
> > +   }
> > +
> > +   rte_dma_submit(dev_id, 0);
> > +   while ((async_cnt > 0) && (poll_cnt++ < POLL_MAX)) {
> > +           nr_cpl = rte_dma_completed(dev_id, 0,
> > MAX_DMA_CPL_NB, NULL, NULL);
> > +           async_cnt -= nr_cpl;
> > +   }
> > +
> > +   return 0;
> > +}
> > +
> 
> <snip>
> 
> > +static int
> > +setup_memory_env(struct test_configure *cfg, struct rte_mbuf ***srcs,
> > +                   struct rte_mbuf ***dsts)
> > +{
> > +   unsigned int buf_size = cfg->buf_size.cur;
> > +   unsigned int nr_sockets;
> > +   uint32_t nr_buf = cfg->nr_buf;
> > +
> > +   nr_sockets = rte_socket_count();
> > +   if (cfg->src_numa_node >= nr_sockets ||
> > +           cfg->dst_numa_node >= nr_sockets) {
> > +           printf("Error: Source or destination numa exceeds the acture
> > numa nodes.\n");
> > +           return -1;
> > +   }
> > +
> > +   src_pool = rte_pktmbuf_pool_create("Benchmark_DMA_SRC",
> > +                   nr_buf,
> > +                   0,
> > +                   0,
> > +                   buf_size + RTE_PKTMBUF_HEADROOM,
> > +                   cfg->src_numa_node);
> > +   if (src_pool == NULL) {
> > +           PRINT_ERR("Error with source mempool creation.\n");
> > +           return -1;
> > +   }
> > +
> > +   dst_pool = rte_pktmbuf_pool_create("Benchmark_DMA_DST",
> > +                   nr_buf,
> > +                   0,
> > +                   0,
> > +                   buf_size + RTE_PKTMBUF_HEADROOM,
> > +                   cfg->dst_numa_node);
> > +   if (dst_pool == NULL) {
> > +           PRINT_ERR("Error with destination mempool creation.\n");
> > +           return -1;
> > +   }
> > +
> > +   *srcs = rte_malloc(NULL, nr_buf * sizeof(struct rte_mbuf *), 0);
> > +   if (*srcs == NULL) {
> > +           printf("Error: srcs malloc failed.\n");
> > +           return -1;
> > +   }
> > +
> > +   *dsts = rte_malloc(NULL, nr_buf * sizeof(struct rte_mbuf *), 0);
> > +   if (*dsts == NULL) {
> > +           printf("Error: dsts malloc failed.\n");
> > +           return -1;
> > +   }
> > +
> > +   if (rte_mempool_get_bulk(src_pool, (void **)*srcs, nr_buf) != 0) {
> 
> [Anoob] Might be better to use 'rte_pktmbuf_alloc_bulk' since we use '
> rte_mbuf_data_iova' in the datapath and it is desirable to initialize it 
> properly
> as an mbuf. Same comment below as well.
> 

[Cheng] sure, I'll fix it in the next version.

> <snip>
> 
> > +
> > +   for (i = 0; i < nb_workers; i++) {
> > +           calc_result(buf_size, nr_buf, nb_workers, test_secs,
> > +                   lcores_p[i].v_ptr->worker_info.test_cpl,
> > +                   &memory, &avg_cycles, &bandwidth, &mops);
> > +           output_result(cfg->scenario_id, lcores_p[i].v_ptr->lcore_id,
> > +                                   lcores_p[i].v_ptr->dma_name,
> > avg_cycles, buf_size,
> > +                                   nr_buf / nb_workers, memory,
> > bandwidth, mops, is_dma);
> > +   }
> 
> [Anoob] Can you also print total_bandwidth & total_mops? It can be a simple
> aggregation in the above loop. Would help when we are dealing with larger
> number of queues but single hardware block.

[Cheng] sure, good point. I'll add it in the next version, thanks.

> 
> > +
> > +out:
> > +   /* free mbufs used in the test */
> > +   if (srcs)
> 
> [Anoob] DPDK coding guidelines recommend a usage like below,
>       (if (srcs != NULL)
> 

[Cheng] sure, thanks. I'll fix it in the next version.

> > +           rte_pktmbuf_free_bulk(srcs, nr_buf);
> > +   if (dsts)
> > +           rte_pktmbuf_free_bulk(dsts, nr_buf);
> > +
> > +   /* free the points for the mbufs */
> > +   rte_free(srcs);
> > +   srcs = NULL;
> > +   rte_free(dsts);
> > +   dsts = NULL;
> > +
> > +   if (src_pool) {
> > +           rte_mempool_free(src_pool);
> > +           src_pool = NULL;
> > +   }
> > +   if (dst_pool) {
> > +           rte_mempool_free(dst_pool);
> > +           src_pool = NULL;
> 
> [Anoob] Should be dst_pool, right?

[Cheng] yes, sorry for the miss. I'll fix it in the next version.

> 
> <snip>
> 
> > diff --git a/app/test-dma-perf/main.h b/app/test-dma-perf/main.h new
> > file mode 100644 index 0000000000..215ac42673
> > --- /dev/null
> > +++ b/app/test-dma-perf/main.h
> > @@ -0,0 +1,69 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2023 Intel Corporation  */
> > +
> > +#ifndef _MAIN_H_
> > +#define _MAIN_H_
> > +
> > +
> > +#include <rte_common.h>
> > +#include <rte_cycles.h>
> > +#include <rte_dev.h>
> > +#include <rte_dmadev.h>
> 
> [Anoob] Is the above include (rte_dmadev.h) required?

[Cheng] you are right. It's not required. I'll remove it in the next version.

> 
> > +
> > +#ifndef __maybe_unused
> > +#define __maybe_unused     __rte_unused
> > +#endif
> 
> [Anoob] Can you try to avoid this and use rte_unused or RTE_SET_USED in
> the cache_flush_buf() function?

[Cheng] sure, I'll fix it in the next version.

RE: [EXT] [PATCH v8] app/dma-perf: introduce dma-perf application

Reply via email to