Hi Anoob, Replies are inline.
Thanks, Cheng > -----Original Message----- > From: Anoob Joseph <ano...@marvell.com> > Sent: Friday, June 23, 2023 2:53 PM > To: Jiang, Cheng1 <cheng1.ji...@intel.com>; tho...@monjalon.net; > Richardson, Bruce <bruce.richard...@intel.com>; > m...@smartsharesystems.com; Xia, Chenbo <chenbo....@intel.com>; Amit > Prakash Shukla <amitpraka...@marvell.com>; huangdeng...@huawei.com; > Laatz, Kevin <kevin.la...@intel.com>; fengcheng...@huawei.com; Jerin > Jacob Kollanukkaran <jer...@marvell.com> > Cc: dev@dpdk.org; Hu, Jiayu <jiayu...@intel.com>; Ding, Xuan > <xuan.d...@intel.com>; Ma, WenwuX <wenwux...@intel.com>; Wang, > YuanX <yuanx.w...@intel.com>; He, Xingguang <xingguang...@intel.com>; > Ling, WeiX <weix.l...@intel.com> > Subject: RE: [EXT] [PATCH v8] app/dma-perf: introduce dma-perf application > > Hi Cheng, > > Thanks for the new version. Please see inline. > > Thanks, > Anoob > > > -----Original Message----- > > From: Cheng Jiang <cheng1.ji...@intel.com> > > Sent: Tuesday, June 20, 2023 12:24 PM > > To: tho...@monjalon.net; bruce.richard...@intel.com; > > m...@smartsharesystems.com; chenbo....@intel.com; Amit Prakash Shukla > > <amitpraka...@marvell.com>; Anoob Joseph <ano...@marvell.com>; > > huangdeng...@huawei.com; kevin.la...@intel.com; > > fengcheng...@huawei.com; Jerin Jacob Kollanukkaran > > <jer...@marvell.com> > > Cc: dev@dpdk.org; jiayu...@intel.com; xuan.d...@intel.com; > > wenwux...@intel.com; yuanx.w...@intel.com; xingguang...@intel.com; > > weix.l...@intel.com; Cheng Jiang <cheng1.ji...@intel.com> > > Subject: [EXT] [PATCH v8] app/dma-perf: introduce dma-perf application > > > > External Email > > > > ---------------------------------------------------------------------- > > There are many high-performance DMA devices supported in DPDK now, > and > > these DMA devices can also be integrated into other modules of DPDK as > > accelerators, such as Vhost. Before integrating DMA into applications, > > developers need to know the performance of these DMA devices in > > various scenarios and the performance of CPUs in the same scenario, > > such as different buffer lengths. Only in this way can we know the > > target performance of the application accelerated by using them. This > > patch introduces a high-performance testing tool, which supports > > comparing the performance of CPU and DMA in different scenarios > > automatically with a pre- set config file. Memory Copy performance test > are supported for now. > > > > Signed-off-by: Cheng Jiang <cheng1.ji...@intel.com> > > Signed-off-by: Jiayu Hu <jiayu...@intel.com> > > Signed-off-by: Yuan Wang <yuanx.w...@intel.com> > > Acked-by: Morten Brørup <m...@smartsharesystems.com> > > Acked-by: Chenbo Xia <chenbo....@intel.com> > > --- > > v8: > > fixed string copy issue in parse_lcore(); > > improved some data display format; > > added doc in doc/guides/tools; > > updated release notes; > > > > v7: > > fixed some strcpy issues; > > removed cache setup in calling rte_pktmbuf_pool_create(); > > fixed some typos; > > added some memory free and null set operations; > > improved result calculation; > > v6: > > improved code based on Anoob's comments; > > fixed some code structure issues; > > v5: > > fixed some LONG_LINE warnings; > > v4: > > fixed inaccuracy of the memory footprint display; > > v3: > > fixed some typos; > > v2: > > added lcore/dmadev designation; > > added error case process; > > removed worker_threads parameter from config.ini; > > improved the logs; > > improved config file; > > > > app/meson.build | 1 + > > app/test-dma-perf/benchmark.c | 498 +++++++++++++++++++++ > > app/test-dma-perf/config.ini | 61 +++ > > app/test-dma-perf/main.c | 594 +++++++++++++++++++++++++ > > app/test-dma-perf/main.h | 69 +++ > > app/test-dma-perf/meson.build | 17 + > > doc/guides/rel_notes/release_23_07.rst | 6 + > > doc/guides/tools/dmaperf.rst | 103 +++++ > > doc/guides/tools/index.rst | 1 + > > 9 files changed, 1350 insertions(+) > > create mode 100644 app/test-dma-perf/benchmark.c create mode 100644 > > app/test-dma-perf/config.ini create mode 100644 app/test-dma- > > perf/main.c create mode 100644 app/test-dma-perf/main.h create mode > > 100644 app/test-dma-perf/meson.build create mode 100644 > > doc/guides/tools/dmaperf.rst > > > > <snip> > > > +/* Configuration of device. */ > > +static void > > +configure_dmadev_queue(uint32_t dev_id, uint32_t ring_size) { > > + uint16_t vchan = 0; > > + struct rte_dma_info info; > > + struct rte_dma_conf dev_config = { .nb_vchans = 1 }; > > + struct rte_dma_vchan_conf qconf = { > > + .direction = RTE_DMA_DIR_MEM_TO_MEM, > > + .nb_desc = ring_size > > + }; > > + > > + if (rte_dma_configure(dev_id, &dev_config) != 0) > > + rte_exit(EXIT_FAILURE, "Error with dma configure.\n"); > > + > > + if (rte_dma_vchan_setup(dev_id, vchan, &qconf) != 0) > > + rte_exit(EXIT_FAILURE, "Error with queue configuration.\n"); > > + > > + rte_dma_info_get(dev_id, &info); > > [Anoob] This API can return errors. Better to add handling. [Cheng] Sure, I'll fix it in the next version. > > > + if (info.nb_vchans != 1) > > + rte_exit(EXIT_FAILURE, "Error, no configured queues > > reported on device id. %u\n", > > + dev_id); > > + > > + if (rte_dma_start(dev_id) != 0) > > + rte_exit(EXIT_FAILURE, "Error with dma start.\n"); } > > + > > +static int > > +config_dmadevs(struct test_configure *cfg) { > > + uint32_t ring_size = cfg->ring_size.cur; > > + struct lcore_dma_map_t *ldm = &cfg->lcore_dma_map; > > + uint32_t nb_workers = ldm->cnt; > > + uint32_t i; > > + int dev_id; > > + uint16_t nb_dmadevs = 0; > > + char *dma_name; > > + > > + for (i = 0; i < ldm->cnt; i++) { > > + dma_name = ldm->dma_names[i]; > > + dev_id = rte_dma_get_dev_id_by_name(dma_name); > > + if (dev_id == -1) { > > [Anoob] Can you check the above API definition? I think it returns not just -1 > in case of errors. [Cheng] Yes, you are right, I'll fix it in the next version. Thanks a lot. > > > + fprintf(stderr, "Error: Fail to find DMA %s.\n", > > dma_name); > > + goto end; > > + } > > + > > + ldm->dma_ids[i] = dev_id; > > + configure_dmadev_queue(dev_id, ring_size); > > + ++nb_dmadevs; > > + } > > + > > +end: > > + if (nb_dmadevs < nb_workers) { > > + printf("Not enough dmadevs (%u) for all workers (%u).\n", > > nb_dmadevs, nb_workers); > > + return -1; > > + } > > + > > + printf("Number of used dmadevs: %u.\n", nb_dmadevs); > > + > > + return 0; > > +} > > + > > +static inline void > > +do_dma_submit_and_poll(uint16_t dev_id, uint64_t *async_cnt, > > + volatile struct worker_info *worker_info) { > > + int ret; > > + uint16_t nr_cpl; > > + > > + ret = rte_dma_submit(dev_id, 0); > > + if (ret < 0) { > > + rte_dma_stop(dev_id); > > + rte_dma_close(dev_id); > > + rte_exit(EXIT_FAILURE, "Error with dma submit.\n"); > > + } > > + > > + nr_cpl = rte_dma_completed(dev_id, 0, MAX_DMA_CPL_NB, NULL, > > NULL); > > + *async_cnt -= nr_cpl; > > + worker_info->total_cpl += nr_cpl; > > +} > > + > > +static inline int > > +do_dma_mem_copy(void *p) > > [Anoob] Just curious, why not pass struct lcore_params *para itself? Is it > because the pointer is volatile? If yes, then we can take an AI to split the > struct into volatile and non-volatile parts. [Cheng] The reason I did it this way is because I want to launch this function on another core by spawning a new thread, and rte_eal_remote_launch() takes a void * as the parameter. That's why I passed void *p. Your suggestion to split the struct into volatile and non-volatile parts is quite reasonable. I am thinking about the best way to implement it. Thanks. > > > +{ > > + const uint16_t *para_idx = (uint16_t *)p; > > + volatile struct lcore_params *para = lcores_p[*para_idx].v_ptr; > > + volatile struct worker_info *worker_info = &(para->worker_info); > > + const uint16_t dev_id = para->dev_id; > > + const uint32_t nr_buf = para->nr_buf; > > + const uint16_t kick_batch = para->kick_batch; > > + const uint32_t buf_size = para->buf_size; > > + struct rte_mbuf **srcs = para->srcs; > > + struct rte_mbuf **dsts = para->dsts; > > + uint16_t nr_cpl; > > + uint64_t async_cnt = 0; > > + uint32_t i; > > + uint32_t poll_cnt = 0; > > + int ret; > > + > > + worker_info->stop_flag = false; > > + worker_info->ready_flag = true; > > + > > + while (!worker_info->start_flag) > > + ; > > + > > + while (1) { > > + for (i = 0; i < nr_buf; i++) { > > +dma_copy: > > + ret = rte_dma_copy(dev_id, 0, > > rte_pktmbuf_iova(srcs[i]), > > + rte_pktmbuf_iova(dsts[i]), buf_size, 0); > > [Anoob] Do we need to use ' rte_mbuf_data_iova' here instead of > 'rte_pktmbuf_iova'? [Cheng] yes rte_mbuf_data_iova is more appropriate, I'll fix it in the next version. Thanks. > > > + if (unlikely(ret < 0)) { > > + if (ret == -ENOSPC) { > > + do_dma_submit_and_poll(dev_id, > > &async_cnt, worker_info); > > + goto dma_copy; > > + } else { > > + /* Error exit */ > > + rte_dma_stop(dev_id); > > [Anoob] Missing rte_dma_close() here. Also, may be introduce a static void > function so that rte_exit call etc won't be part of the fastpath loop. > > May be something like below and you can call it from here and > "do_dma_submit_and_poll". > > static void > error_exit(int dev_id) > { > /* Error exit */ > rte_dma_stop(dev_id); > rte_dma_close(dev_id); > rte_exit(EXIT_FAILURE, "DMA enqueue failed\n"); } > [Cheng] I'm not so sure here. rte_dma_close() is called in the rte_exit(). Do we still call it explicitly before rte_exit()? > > + rte_exit(EXIT_FAILURE, "DMA > > enqueue failed\n"); > > + } > > + } > > + async_cnt++; > > + > > + if ((async_cnt % kick_batch) == 0) > > + do_dma_submit_and_poll(dev_id, > > &async_cnt, worker_info); > > + } > > + > > + if (worker_info->stop_flag) > > + break; > > + } > > + > > + rte_dma_submit(dev_id, 0); > > + while ((async_cnt > 0) && (poll_cnt++ < POLL_MAX)) { > > + nr_cpl = rte_dma_completed(dev_id, 0, > > MAX_DMA_CPL_NB, NULL, NULL); > > + async_cnt -= nr_cpl; > > + } > > + > > + return 0; > > +} > > + > > <snip> > > > +static int > > +setup_memory_env(struct test_configure *cfg, struct rte_mbuf ***srcs, > > + struct rte_mbuf ***dsts) > > +{ > > + unsigned int buf_size = cfg->buf_size.cur; > > + unsigned int nr_sockets; > > + uint32_t nr_buf = cfg->nr_buf; > > + > > + nr_sockets = rte_socket_count(); > > + if (cfg->src_numa_node >= nr_sockets || > > + cfg->dst_numa_node >= nr_sockets) { > > + printf("Error: Source or destination numa exceeds the acture > > numa nodes.\n"); > > + return -1; > > + } > > + > > + src_pool = rte_pktmbuf_pool_create("Benchmark_DMA_SRC", > > + nr_buf, > > + 0, > > + 0, > > + buf_size + RTE_PKTMBUF_HEADROOM, > > + cfg->src_numa_node); > > + if (src_pool == NULL) { > > + PRINT_ERR("Error with source mempool creation.\n"); > > + return -1; > > + } > > + > > + dst_pool = rte_pktmbuf_pool_create("Benchmark_DMA_DST", > > + nr_buf, > > + 0, > > + 0, > > + buf_size + RTE_PKTMBUF_HEADROOM, > > + cfg->dst_numa_node); > > + if (dst_pool == NULL) { > > + PRINT_ERR("Error with destination mempool creation.\n"); > > + return -1; > > + } > > + > > + *srcs = rte_malloc(NULL, nr_buf * sizeof(struct rte_mbuf *), 0); > > + if (*srcs == NULL) { > > + printf("Error: srcs malloc failed.\n"); > > + return -1; > > + } > > + > > + *dsts = rte_malloc(NULL, nr_buf * sizeof(struct rte_mbuf *), 0); > > + if (*dsts == NULL) { > > + printf("Error: dsts malloc failed.\n"); > > + return -1; > > + } > > + > > + if (rte_mempool_get_bulk(src_pool, (void **)*srcs, nr_buf) != 0) { > > [Anoob] Might be better to use 'rte_pktmbuf_alloc_bulk' since we use ' > rte_mbuf_data_iova' in the datapath and it is desirable to initialize it > properly > as an mbuf. Same comment below as well. > [Cheng] sure, I'll fix it in the next version. > <snip> > > > + > > + for (i = 0; i < nb_workers; i++) { > > + calc_result(buf_size, nr_buf, nb_workers, test_secs, > > + lcores_p[i].v_ptr->worker_info.test_cpl, > > + &memory, &avg_cycles, &bandwidth, &mops); > > + output_result(cfg->scenario_id, lcores_p[i].v_ptr->lcore_id, > > + lcores_p[i].v_ptr->dma_name, > > avg_cycles, buf_size, > > + nr_buf / nb_workers, memory, > > bandwidth, mops, is_dma); > > + } > > [Anoob] Can you also print total_bandwidth & total_mops? It can be a simple > aggregation in the above loop. Would help when we are dealing with larger > number of queues but single hardware block. [Cheng] sure, good point. I'll add it in the next version, thanks. > > > + > > +out: > > + /* free mbufs used in the test */ > > + if (srcs) > > [Anoob] DPDK coding guidelines recommend a usage like below, > (if (srcs != NULL) > [Cheng] sure, thanks. I'll fix it in the next version. > > + rte_pktmbuf_free_bulk(srcs, nr_buf); > > + if (dsts) > > + rte_pktmbuf_free_bulk(dsts, nr_buf); > > + > > + /* free the points for the mbufs */ > > + rte_free(srcs); > > + srcs = NULL; > > + rte_free(dsts); > > + dsts = NULL; > > + > > + if (src_pool) { > > + rte_mempool_free(src_pool); > > + src_pool = NULL; > > + } > > + if (dst_pool) { > > + rte_mempool_free(dst_pool); > > + src_pool = NULL; > > [Anoob] Should be dst_pool, right? [Cheng] yes, sorry for the miss. I'll fix it in the next version. > > <snip> > > > diff --git a/app/test-dma-perf/main.h b/app/test-dma-perf/main.h new > > file mode 100644 index 0000000000..215ac42673 > > --- /dev/null > > +++ b/app/test-dma-perf/main.h > > @@ -0,0 +1,69 @@ > > +/* SPDX-License-Identifier: BSD-3-Clause > > + * Copyright(c) 2023 Intel Corporation */ > > + > > +#ifndef _MAIN_H_ > > +#define _MAIN_H_ > > + > > + > > +#include <rte_common.h> > > +#include <rte_cycles.h> > > +#include <rte_dev.h> > > +#include <rte_dmadev.h> > > [Anoob] Is the above include (rte_dmadev.h) required? [Cheng] you are right. It's not required. I'll remove it in the next version. > > > + > > +#ifndef __maybe_unused > > +#define __maybe_unused __rte_unused > > +#endif > > [Anoob] Can you try to avoid this and use rte_unused or RTE_SET_USED in > the cache_flush_buf() function? [Cheng] sure, I'll fix it in the next version.