> From: Olivier Matz [mailto:olivier.m...@6wind.com] > Sent: Monday, 24 January 2022 16.00 > > From: Morten Brørup <m...@smartsharesystems.com> > > "What gets measured gets done." > > This patch adds mempool performance tests where the number of objects > to > put and get is constant at compile time, which may significantly > improve > the performance of these functions. [*] > > Also, it is ensured that the array holding the object used for testing > is cache line aligned, for maximum performance. > > And finally, the following entries are added to the list of tests: > - Number of kept objects: 512 > - Number of objects to get and to put: The number of pointers fitting > into a cache line, i.e. 8 or 16 > > [*] Some example performance test (with cache) results: > > get_bulk=4 put_bulk=4 keep=128 constant_n=false rate_persec=280480972 > get_bulk=4 put_bulk=4 keep=128 constant_n=true rate_persec=622159462 > > get_bulk=8 put_bulk=8 keep=128 constant_n=false rate_persec=477967155 > get_bulk=8 put_bulk=8 keep=128 constant_n=true rate_persec=917582643 > > get_bulk=32 put_bulk=32 keep=32 constant_n=false rate_persec=871248691 > get_bulk=32 put_bulk=32 keep=32 constant_n=true rate_persec=1134021836 > > Signed-off-by: Morten Brørup <m...@smartsharesystems.com> > Signed-off-by: Olivier Matz <olivier.m...@6wind.com> > --- > > Hi Morten, > > Here is the updated patch. > > I launched the mempool_perf on my desktop machine, but I don't > reproduce the numbers: constant or > non-constant give almost the same rate on my machine (it's even worst > with constants). I tested with > your initial patch and with this one. Can you please try this patch, > and/or give some details about > your test environment?
Test environment: VMware virtual machine running Ubuntu 20.04.3 LTS. 4 CPUs and 8 GB RAM assigned. The physical CPU is a Xeon E5-2620 v4 with plenty of RAM. Although other VMs are running on the same server, it is not very oversubscribed. Hugepages established with: usertools/dpdk-hugepages.py -p 2M --setup 2G Build steps: meson -Dplatform=generic work cd work ninja > Here is what I get: > > with your patch: > mempool_autotest cache=512 cores=1 n_get_bulk=8 n_put_bulk=8 n_keep=128 > constant_n=false rate_persec=152620236 > mempool_autotest cache=512 cores=1 n_get_bulk=8 n_put_bulk=8 n_keep=128 > constant_n=true rate_persec=144716595 > mempool_autotest cache=512 cores=2 n_get_bulk=8 n_put_bulk=8 n_keep=128 > constant_n=false rate_persec=306996838 > mempool_autotest cache=512 cores=2 n_get_bulk=8 n_put_bulk=8 n_keep=128 > constant_n=true rate_persec=287375359 > mempool_autotest cache=512 cores=12 n_get_bulk=8 n_put_bulk=8 > n_keep=128 constant_n=false rate_persec=977626723 > mempool_autotest cache=512 cores=12 n_get_bulk=8 n_put_bulk=8 > n_keep=128 constant_n=true rate_persec=963103944 My test results were with an experimental, optimized version of the mempool library, which showed a larger difference. (This was the reason for updating the perf test - to measure the effects of optimizing the mempool library.) However, testing the patch (version 1) with a brand new git checkout still shows a huge difference, e.g.: mempool_autotest cache=512 cores=1 n_get_bulk=8 n_put_bulk=8 n_keep=128 constant_n=false rate_persec=501009612 mempool_autotest cache=512 cores=1 n_get_bulk=8 n_put_bulk=8 n_keep=128 constant_n=true rate_persec=799014912 You should also see a significant difference when testing. My rate_persec without constant n is 3 x yours (501 M vs. 156 M ops/s), so the baseline seems wrong! I don't think our server rig is so much faster than your desktop machine. Perhaps mempool debug, telemetry or other background noise is polluting your test. > > with this patch: > mempool_autotest cache=512 cores=1 n_get_bulk=8 n_put_bulk=8 n_keep=128 > constant_n=0 rate_persec=156460646 > mempool_autotest cache=512 cores=1 n_get_bulk=8 n_put_bulk=8 n_keep=128 > constant_n=1 rate_persec=142173798 > mempool_autotest cache=512 cores=2 n_get_bulk=8 n_put_bulk=8 n_keep=128 > constant_n=0 rate_persec=312410111 > mempool_autotest cache=512 cores=2 n_get_bulk=8 n_put_bulk=8 n_keep=128 > constant_n=1 rate_persec=281699942 > mempool_autotest cache=512 cores=12 n_get_bulk=8 n_put_bulk=8 > n_keep=128 constant_n=0 rate_persec=983315247 > mempool_autotest cache=512 cores=12 n_get_bulk=8 n_put_bulk=8 > n_keep=128 constant_n=1 rate_persec=950350638 > > > v2: > - use a flag instead of a negative value to enable tests with > compile-time constant > - use a static inline function instead of a macro > - remove some "noise" (do not change variable type when not required) > > > Thanks, > Olivier > > > app/test/test_mempool_perf.c | 110 ++++++++++++++++++++++++----------- > 1 file changed, 77 insertions(+), 33 deletions(-) > > diff --git a/app/test/test_mempool_perf.c > b/app/test/test_mempool_perf.c > index 87ad251367..ce7c6241ab 100644 > --- a/app/test/test_mempool_perf.c > +++ b/app/test/test_mempool_perf.c > @@ -1,5 +1,6 @@ > /* SPDX-License-Identifier: BSD-3-Clause > * Copyright(c) 2010-2014 Intel Corporation > + * Copyright(c) 2022 SmartShare Systems > */ > > #include <string.h> > @@ -55,19 +56,24 @@ > * > * - Bulk get from 1 to 32 > * - Bulk put from 1 to 32 > + * - Bulk get and put from 1 to 32, compile time constant > * > * - Number of kept objects (*n_keep*) > * > * - 32 > * - 128 > + * - 512 > */ > > #define N 65536 > #define TIME_S 5 > #define MEMPOOL_ELT_SIZE 2048 > -#define MAX_KEEP 128 > +#define MAX_KEEP 512 > #define MEMPOOL_SIZE > ((rte_lcore_count()*(MAX_KEEP+RTE_MEMPOOL_CACHE_MAX_SIZE))-1) > > +/* Number of pointers fitting into one cache line. */ > +#define CACHE_LINE_BURST (RTE_CACHE_LINE_SIZE / sizeof(uintptr_t)) > + > #define LOG_ERR() printf("test failed at %s():%d\n", __func__, > __LINE__) > #define RET_ERR() do { > \ > LOG_ERR(); \ > @@ -91,6 +97,9 @@ static unsigned n_put_bulk; > /* number of objects retrieved from mempool before putting them back > */ > static unsigned n_keep; > > +/* true if we want to test with constant n_get_bulk and n_put_bulk */ > +static int use_constant_values; > + > /* number of enqueues / dequeues */ > struct mempool_test_stats { > uint64_t enq_count; > @@ -111,11 +120,43 @@ my_obj_init(struct rte_mempool *mp, __rte_unused > void *arg, > *objnum = i; > } > > +static __rte_always_inline int > +test_loop(struct rte_mempool *mp, struct rte_mempool_cache *cache, > + unsigned int x_keep, unsigned int x_get_bulk, unsigned int > x_put_bulk) > +{ > + void *obj_table[MAX_KEEP] __rte_cache_aligned; > + unsigned int idx; > + unsigned int i; > + int ret; > + > + for (i = 0; likely(i < (N / x_keep)); i++) { > + /* get x_keep objects by bulk of x_get_bulk */ > + for (idx = 0; idx < x_keep; idx += x_get_bulk) { > + ret = rte_mempool_generic_get(mp, > + &obj_table[idx], > + x_get_bulk, > + cache); > + if (unlikely(ret < 0)) { > + rte_mempool_dump(stdout, mp); > + return ret; > + } > + } > + > + /* put the objects back by bulk of x_put_bulk */ > + for (idx = 0; idx < x_keep; idx += x_put_bulk) { > + rte_mempool_generic_put(mp, > + &obj_table[idx], > + x_put_bulk, > + cache); > + } > + } > + > + return 0; > +} > + > static int > per_lcore_mempool_test(void *arg) > { > - void *obj_table[MAX_KEEP]; > - unsigned i, idx; > struct rte_mempool *mp = arg; > unsigned lcore_id = rte_lcore_id(); > int ret = 0; > @@ -139,6 +180,9 @@ per_lcore_mempool_test(void *arg) > GOTO_ERR(ret, out); > if (((n_keep / n_put_bulk) * n_put_bulk) != n_keep) > GOTO_ERR(ret, out); > + /* for constant n, n_get_bulk and n_put_bulk must be the same */ > + if (use_constant_values && n_put_bulk != n_get_bulk) > + GOTO_ERR(ret, out); > > stats[lcore_id].enq_count = 0; > > @@ -149,31 +193,23 @@ per_lcore_mempool_test(void *arg) > start_cycles = rte_get_timer_cycles(); > > while (time_diff/hz < TIME_S) { > - for (i = 0; likely(i < (N/n_keep)); i++) { > - /* get n_keep objects by bulk of n_bulk */ > - idx = 0; > - while (idx < n_keep) { > - ret = rte_mempool_generic_get(mp, > - &obj_table[idx], > - n_get_bulk, > - cache); > - if (unlikely(ret < 0)) { > - rte_mempool_dump(stdout, mp); > - /* in this case, objects are lost... */ > - GOTO_ERR(ret, out); > - } > - idx += n_get_bulk; > - } > + if (!use_constant_values) > + ret = test_loop(mp, cache, n_keep, n_get_bulk, > n_put_bulk); > + else if (n_get_bulk == 1) > + ret = test_loop(mp, cache, n_keep, 1, 1); > + else if (n_get_bulk == 4) > + ret = test_loop(mp, cache, n_keep, 4, 4); > + else if (n_get_bulk == CACHE_LINE_BURST) > + ret = test_loop(mp, cache, n_keep, > + CACHE_LINE_BURST, CACHE_LINE_BURST); > + else if (n_get_bulk == 32) > + ret = test_loop(mp, cache, n_keep, 32, 32); > + else > + ret = -1; > + > + if (ret < 0) > + GOTO_ERR(ret, out); > > - /* put the objects back */ > - idx = 0; > - while (idx < n_keep) { > - rte_mempool_generic_put(mp, &obj_table[idx], > - n_put_bulk, > - cache); > - idx += n_put_bulk; > - } > - } > end_cycles = rte_get_timer_cycles(); > time_diff = end_cycles - start_cycles; > stats[lcore_id].enq_count += N; > @@ -203,10 +239,10 @@ launch_cores(struct rte_mempool *mp, unsigned int > cores) > memset(stats, 0, sizeof(stats)); > > printf("mempool_autotest cache=%u cores=%u n_get_bulk=%u " > - "n_put_bulk=%u n_keep=%u ", > + "n_put_bulk=%u n_keep=%u constant_n=%u ", > use_external_cache ? > external_cache_size : (unsigned) mp->cache_size, > - cores, n_get_bulk, n_put_bulk, n_keep); > + cores, n_get_bulk, n_put_bulk, n_keep, > use_constant_values); > > if (rte_mempool_avail_count(mp) != MEMPOOL_SIZE) { > printf("mempool is not full\n"); > @@ -253,9 +289,9 @@ launch_cores(struct rte_mempool *mp, unsigned int > cores) > static int > do_one_mempool_test(struct rte_mempool *mp, unsigned int cores) > { > - unsigned bulk_tab_get[] = { 1, 4, 32, 0 }; > - unsigned bulk_tab_put[] = { 1, 4, 32, 0 }; > - unsigned keep_tab[] = { 32, 128, 0 }; > + unsigned int bulk_tab_get[] = { 1, 4, CACHE_LINE_BURST, 32, 0 }; > + unsigned int bulk_tab_put[] = { 1, 4, CACHE_LINE_BURST, 32, 0 }; > + unsigned int keep_tab[] = { 32, 128, 512, 0 }; > unsigned *get_bulk_ptr; > unsigned *put_bulk_ptr; > unsigned *keep_ptr; > @@ -265,13 +301,21 @@ do_one_mempool_test(struct rte_mempool *mp, > unsigned int cores) > for (put_bulk_ptr = bulk_tab_put; *put_bulk_ptr; > put_bulk_ptr++) { > for (keep_ptr = keep_tab; *keep_ptr; keep_ptr++) { > > + use_constant_values = 0; > n_get_bulk = *get_bulk_ptr; > n_put_bulk = *put_bulk_ptr; > n_keep = *keep_ptr; > ret = launch_cores(mp, cores); > - > if (ret < 0) > return -1; > + > + /* replay test with constant values */ > + if (n_get_bulk == n_put_bulk) { > + use_constant_values = 1; > + ret = launch_cores(mp, cores); > + if (ret < 0) > + return -1; > + } > } > } > } > -- > 2.30.2