Hi Cheng, Thanks for the new version. Please see inline.
Thanks, Anoob > -----Original Message----- > From: Cheng Jiang <cheng1.ji...@intel.com> > Sent: Tuesday, June 20, 2023 12:24 PM > To: tho...@monjalon.net; bruce.richard...@intel.com; > m...@smartsharesystems.com; chenbo....@intel.com; Amit Prakash Shukla > <amitpraka...@marvell.com>; Anoob Joseph <ano...@marvell.com>; > huangdeng...@huawei.com; kevin.la...@intel.com; > fengcheng...@huawei.com; Jerin Jacob Kollanukkaran > <jer...@marvell.com> > Cc: dev@dpdk.org; jiayu...@intel.com; xuan.d...@intel.com; > wenwux...@intel.com; yuanx.w...@intel.com; xingguang...@intel.com; > weix.l...@intel.com; Cheng Jiang <cheng1.ji...@intel.com> > Subject: [EXT] [PATCH v8] app/dma-perf: introduce dma-perf application > > External Email > > ---------------------------------------------------------------------- > There are many high-performance DMA devices supported in DPDK now, > and these DMA devices can also be integrated into other modules of DPDK as > accelerators, such as Vhost. Before integrating DMA into applications, > developers need to know the performance of these DMA devices in various > scenarios and the performance of CPUs in the same scenario, such as > different buffer lengths. Only in this way can we know the target > performance of the application accelerated by using them. This patch > introduces a high-performance testing tool, which supports comparing the > performance of CPU and DMA in different scenarios automatically with a pre- > set config file. Memory Copy performance test are supported for now. > > Signed-off-by: Cheng Jiang <cheng1.ji...@intel.com> > Signed-off-by: Jiayu Hu <jiayu...@intel.com> > Signed-off-by: Yuan Wang <yuanx.w...@intel.com> > Acked-by: Morten Brørup <m...@smartsharesystems.com> > Acked-by: Chenbo Xia <chenbo....@intel.com> > --- > v8: > fixed string copy issue in parse_lcore(); > improved some data display format; > added doc in doc/guides/tools; > updated release notes; > > v7: > fixed some strcpy issues; > removed cache setup in calling rte_pktmbuf_pool_create(); > fixed some typos; > added some memory free and null set operations; > improved result calculation; > v6: > improved code based on Anoob's comments; > fixed some code structure issues; > v5: > fixed some LONG_LINE warnings; > v4: > fixed inaccuracy of the memory footprint display; > v3: > fixed some typos; > v2: > added lcore/dmadev designation; > added error case process; > removed worker_threads parameter from config.ini; > improved the logs; > improved config file; > > app/meson.build | 1 + > app/test-dma-perf/benchmark.c | 498 +++++++++++++++++++++ > app/test-dma-perf/config.ini | 61 +++ > app/test-dma-perf/main.c | 594 +++++++++++++++++++++++++ > app/test-dma-perf/main.h | 69 +++ > app/test-dma-perf/meson.build | 17 + > doc/guides/rel_notes/release_23_07.rst | 6 + > doc/guides/tools/dmaperf.rst | 103 +++++ > doc/guides/tools/index.rst | 1 + > 9 files changed, 1350 insertions(+) > create mode 100644 app/test-dma-perf/benchmark.c create mode 100644 > app/test-dma-perf/config.ini create mode 100644 app/test-dma- > perf/main.c create mode 100644 app/test-dma-perf/main.h create mode > 100644 app/test-dma-perf/meson.build create mode 100644 > doc/guides/tools/dmaperf.rst > <snip> > +/* Configuration of device. */ > +static void > +configure_dmadev_queue(uint32_t dev_id, uint32_t ring_size) { > + uint16_t vchan = 0; > + struct rte_dma_info info; > + struct rte_dma_conf dev_config = { .nb_vchans = 1 }; > + struct rte_dma_vchan_conf qconf = { > + .direction = RTE_DMA_DIR_MEM_TO_MEM, > + .nb_desc = ring_size > + }; > + > + if (rte_dma_configure(dev_id, &dev_config) != 0) > + rte_exit(EXIT_FAILURE, "Error with dma configure.\n"); > + > + if (rte_dma_vchan_setup(dev_id, vchan, &qconf) != 0) > + rte_exit(EXIT_FAILURE, "Error with queue configuration.\n"); > + > + rte_dma_info_get(dev_id, &info); [Anoob] This API can return errors. Better to add handling. > + if (info.nb_vchans != 1) > + rte_exit(EXIT_FAILURE, "Error, no configured queues > reported on device id. %u\n", > + dev_id); > + > + if (rte_dma_start(dev_id) != 0) > + rte_exit(EXIT_FAILURE, "Error with dma start.\n"); } > + > +static int > +config_dmadevs(struct test_configure *cfg) { > + uint32_t ring_size = cfg->ring_size.cur; > + struct lcore_dma_map_t *ldm = &cfg->lcore_dma_map; > + uint32_t nb_workers = ldm->cnt; > + uint32_t i; > + int dev_id; > + uint16_t nb_dmadevs = 0; > + char *dma_name; > + > + for (i = 0; i < ldm->cnt; i++) { > + dma_name = ldm->dma_names[i]; > + dev_id = rte_dma_get_dev_id_by_name(dma_name); > + if (dev_id == -1) { [Anoob] Can you check the above API definition? I think it returns not just -1 in case of errors. > + fprintf(stderr, "Error: Fail to find DMA %s.\n", > dma_name); > + goto end; > + } > + > + ldm->dma_ids[i] = dev_id; > + configure_dmadev_queue(dev_id, ring_size); > + ++nb_dmadevs; > + } > + > +end: > + if (nb_dmadevs < nb_workers) { > + printf("Not enough dmadevs (%u) for all workers (%u).\n", > nb_dmadevs, nb_workers); > + return -1; > + } > + > + printf("Number of used dmadevs: %u.\n", nb_dmadevs); > + > + return 0; > +} > + > +static inline void > +do_dma_submit_and_poll(uint16_t dev_id, uint64_t *async_cnt, > + volatile struct worker_info *worker_info) { > + int ret; > + uint16_t nr_cpl; > + > + ret = rte_dma_submit(dev_id, 0); > + if (ret < 0) { > + rte_dma_stop(dev_id); > + rte_dma_close(dev_id); > + rte_exit(EXIT_FAILURE, "Error with dma submit.\n"); > + } > + > + nr_cpl = rte_dma_completed(dev_id, 0, MAX_DMA_CPL_NB, NULL, > NULL); > + *async_cnt -= nr_cpl; > + worker_info->total_cpl += nr_cpl; > +} > + > +static inline int > +do_dma_mem_copy(void *p) [Anoob] Just curious, why not pass struct lcore_params *para itself? Is it because the pointer is volatile? If yes, then we can take an AI to split the struct into volatile and non-volatile parts. > +{ > + const uint16_t *para_idx = (uint16_t *)p; > + volatile struct lcore_params *para = lcores_p[*para_idx].v_ptr; > + volatile struct worker_info *worker_info = &(para->worker_info); > + const uint16_t dev_id = para->dev_id; > + const uint32_t nr_buf = para->nr_buf; > + const uint16_t kick_batch = para->kick_batch; > + const uint32_t buf_size = para->buf_size; > + struct rte_mbuf **srcs = para->srcs; > + struct rte_mbuf **dsts = para->dsts; > + uint16_t nr_cpl; > + uint64_t async_cnt = 0; > + uint32_t i; > + uint32_t poll_cnt = 0; > + int ret; > + > + worker_info->stop_flag = false; > + worker_info->ready_flag = true; > + > + while (!worker_info->start_flag) > + ; > + > + while (1) { > + for (i = 0; i < nr_buf; i++) { > +dma_copy: > + ret = rte_dma_copy(dev_id, 0, > rte_pktmbuf_iova(srcs[i]), > + rte_pktmbuf_iova(dsts[i]), buf_size, 0); [Anoob] Do we need to use ' rte_mbuf_data_iova' here instead of 'rte_pktmbuf_iova'? > + if (unlikely(ret < 0)) { > + if (ret == -ENOSPC) { > + do_dma_submit_and_poll(dev_id, > &async_cnt, worker_info); > + goto dma_copy; > + } else { > + /* Error exit */ > + rte_dma_stop(dev_id); [Anoob] Missing rte_dma_close() here. Also, may be introduce a static void function so that rte_exit call etc won't be part of the fastpath loop. May be something like below and you can call it from here and "do_dma_submit_and_poll". static void error_exit(int dev_id) { /* Error exit */ rte_dma_stop(dev_id); rte_dma_close(dev_id); rte_exit(EXIT_FAILURE, "DMA enqueue failed\n"); } > + rte_exit(EXIT_FAILURE, "DMA > enqueue failed\n"); > + } > + } > + async_cnt++; > + > + if ((async_cnt % kick_batch) == 0) > + do_dma_submit_and_poll(dev_id, > &async_cnt, worker_info); > + } > + > + if (worker_info->stop_flag) > + break; > + } > + > + rte_dma_submit(dev_id, 0); > + while ((async_cnt > 0) && (poll_cnt++ < POLL_MAX)) { > + nr_cpl = rte_dma_completed(dev_id, 0, > MAX_DMA_CPL_NB, NULL, NULL); > + async_cnt -= nr_cpl; > + } > + > + return 0; > +} > + <snip> > +static int > +setup_memory_env(struct test_configure *cfg, struct rte_mbuf ***srcs, > + struct rte_mbuf ***dsts) > +{ > + unsigned int buf_size = cfg->buf_size.cur; > + unsigned int nr_sockets; > + uint32_t nr_buf = cfg->nr_buf; > + > + nr_sockets = rte_socket_count(); > + if (cfg->src_numa_node >= nr_sockets || > + cfg->dst_numa_node >= nr_sockets) { > + printf("Error: Source or destination numa exceeds the acture > numa nodes.\n"); > + return -1; > + } > + > + src_pool = rte_pktmbuf_pool_create("Benchmark_DMA_SRC", > + nr_buf, > + 0, > + 0, > + buf_size + RTE_PKTMBUF_HEADROOM, > + cfg->src_numa_node); > + if (src_pool == NULL) { > + PRINT_ERR("Error with source mempool creation.\n"); > + return -1; > + } > + > + dst_pool = rte_pktmbuf_pool_create("Benchmark_DMA_DST", > + nr_buf, > + 0, > + 0, > + buf_size + RTE_PKTMBUF_HEADROOM, > + cfg->dst_numa_node); > + if (dst_pool == NULL) { > + PRINT_ERR("Error with destination mempool creation.\n"); > + return -1; > + } > + > + *srcs = rte_malloc(NULL, nr_buf * sizeof(struct rte_mbuf *), 0); > + if (*srcs == NULL) { > + printf("Error: srcs malloc failed.\n"); > + return -1; > + } > + > + *dsts = rte_malloc(NULL, nr_buf * sizeof(struct rte_mbuf *), 0); > + if (*dsts == NULL) { > + printf("Error: dsts malloc failed.\n"); > + return -1; > + } > + > + if (rte_mempool_get_bulk(src_pool, (void **)*srcs, nr_buf) != 0) { [Anoob] Might be better to use 'rte_pktmbuf_alloc_bulk' since we use ' rte_mbuf_data_iova' in the datapath and it is desirable to initialize it properly as an mbuf. Same comment below as well. <snip> > + > + for (i = 0; i < nb_workers; i++) { > + calc_result(buf_size, nr_buf, nb_workers, test_secs, > + lcores_p[i].v_ptr->worker_info.test_cpl, > + &memory, &avg_cycles, &bandwidth, &mops); > + output_result(cfg->scenario_id, lcores_p[i].v_ptr->lcore_id, > + lcores_p[i].v_ptr->dma_name, > avg_cycles, buf_size, > + nr_buf / nb_workers, memory, > bandwidth, mops, is_dma); > + } [Anoob] Can you also print total_bandwidth & total_mops? It can be a simple aggregation in the above loop. Would help when we are dealing with larger number of queues but single hardware block. > + > +out: > + /* free mbufs used in the test */ > + if (srcs) [Anoob] DPDK coding guidelines recommend a usage like below, (if (srcs != NULL) > + rte_pktmbuf_free_bulk(srcs, nr_buf); > + if (dsts) > + rte_pktmbuf_free_bulk(dsts, nr_buf); > + > + /* free the points for the mbufs */ > + rte_free(srcs); > + srcs = NULL; > + rte_free(dsts); > + dsts = NULL; > + > + if (src_pool) { > + rte_mempool_free(src_pool); > + src_pool = NULL; > + } > + if (dst_pool) { > + rte_mempool_free(dst_pool); > + src_pool = NULL; [Anoob] Should be dst_pool, right? <snip> > diff --git a/app/test-dma-perf/main.h b/app/test-dma-perf/main.h new file > mode 100644 index 0000000000..215ac42673 > --- /dev/null > +++ b/app/test-dma-perf/main.h > @@ -0,0 +1,69 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(c) 2023 Intel Corporation > + */ > + > +#ifndef _MAIN_H_ > +#define _MAIN_H_ > + > + > +#include <rte_common.h> > +#include <rte_cycles.h> > +#include <rte_dev.h> > +#include <rte_dmadev.h> [Anoob] Is the above include (rte_dmadev.h) required? > + > +#ifndef __maybe_unused > +#define __maybe_unused __rte_unused > +#endif [Anoob] Can you try to avoid this and use rte_unused or RTE_SET_USED in the cache_flush_buf() function?