> From: Dharmik Thakkar [mailto:dharmik.thak...@arm.com] > Sent: Friday, 24 December 2021 23.59 > > Current mempool per core cache implementation stores pointers to mbufs > On 64b architectures, each pointer consumes 8B > This patch replaces it with index-based implementation, > where in each buffer is addressed by (pool base address + index) > It reduces the amount of memory/cache required for per core cache > > L3Fwd performance testing reveals minor improvements in the cache > performance (L1 and L2 misses reduced by 0.60%) > with no change in throughput > > Micro-benchmarking the patch using mempool_perf_test shows > significant improvement with majority of the test cases > > Number of cores = 1: > n_get_bulk=1 n_put_bulk=1 n_keep=32 %_change_with_patch=18.01 > n_get_bulk=1 n_put_bulk=1 n_keep=128 %_change_with_patch=19.91 > n_get_bulk=1 n_put_bulk=4 n_keep=32 %_change_with_patch=-20.37 > (regression) > n_get_bulk=1 n_put_bulk=4 n_keep=128 %_change_with_patch=-17.01 > (regression) > n_get_bulk=1 n_put_bulk=32 n_keep=32 %_change_with_patch=-25.06 > (regression) > n_get_bulk=1 n_put_bulk=32 n_keep=128 %_change_with_patch=-23.81 > (regression) > n_get_bulk=4 n_put_bulk=1 n_keep=32 %_change_with_patch=53.93 > n_get_bulk=4 n_put_bulk=1 n_keep=128 %_change_with_patch=60.90 > n_get_bulk=4 n_put_bulk=4 n_keep=32 %_change_with_patch=1.64 > n_get_bulk=4 n_put_bulk=4 n_keep=128 %_change_with_patch=8.76 > n_get_bulk=4 n_put_bulk=32 n_keep=32 %_change_with_patch=-4.71 > (regression) > n_get_bulk=4 n_put_bulk=32 n_keep=128 %_change_with_patch=-3.19 > (regression) > n_get_bulk=32 n_put_bulk=1 n_keep=32 %_change_with_patch=65.63 > n_get_bulk=32 n_put_bulk=1 n_keep=128 %_change_with_patch=75.19 > n_get_bulk=32 n_put_bulk=4 n_keep=32 %_change_with_patch=11.75 > n_get_bulk=32 n_put_bulk=4 n_keep=128 %_change_with_patch=15.52 > n_get_bulk=32 n_put_bulk=32 n_keep=32 %_change_with_patch=13.45 > n_get_bulk=32 n_put_bulk=32 n_keep=128 %_change_with_patch=11.58 > > Number of core = 2: > n_get_bulk=1 n_put_bulk=1 n_keep=32 %_change_with_patch=18.21 > n_get_bulk=1 n_put_bulk=1 n_keep=128 %_change_with_patch=21.89 > n_get_bulk=1 n_put_bulk=4 n_keep=32 %_change_with_patch=-21.21 > (regression) > n_get_bulk=1 n_put_bulk=4 n_keep=128 %_change_with_patch=-17.05 > (regression) > n_get_bulk=1 n_put_bulk=32 n_keep=32 %_change_with_patch=-26.09 > (regression) > n_get_bulk=1 n_put_bulk=32 n_keep=128 %_change_with_patch=-23.49 > (regression) > n_get_bulk=4 n_put_bulk=1 n_keep=32 %_change_with_patch=56.28 > n_get_bulk=4 n_put_bulk=1 n_keep=128 %_change_with_patch=67.69 > n_get_bulk=4 n_put_bulk=4 n_keep=32 %_change_with_patch=1.45 > n_get_bulk=4 n_put_bulk=4 n_keep=128 %_change_with_patch=8.84 > n_get_bulk=4 n_put_bulk=32 n_keep=32 %_change_with_patch=-5.27 > (regression) > n_get_bulk=4 n_put_bulk=32 n_keep=128 %_change_with_patch=-3.09 > (regression) > n_get_bulk=32 n_put_bulk=1 n_keep=32 %_change_with_patch=76.11 > n_get_bulk=32 n_put_bulk=1 n_keep=128 %_change_with_patch=86.06 > n_get_bulk=32 n_put_bulk=4 n_keep=32 %_change_with_patch=11.86 > n_get_bulk=32 n_put_bulk=4 n_keep=128 %_change_with_patch=16.55 > n_get_bulk=32 n_put_bulk=32 n_keep=32 %_change_with_patch=13.01 > n_get_bulk=32 n_put_bulk=32 n_keep=128 %_change_with_patch=11.51 > > > From analyzing the results, it is clear that for n_get_bulk and > n_put_bulk sizes of 32 there is no performance regression > IMO, the other sizes are not practical from performance perspective > and the regression in those cases can be safely ignored > > Dharmik Thakkar (1): > mempool: implement index-based per core cache > > lib/mempool/rte_mempool.h | 114 +++++++++++++++++++++++++- > lib/mempool/rte_mempool_ops_default.c | 7 ++ > 2 files changed, 119 insertions(+), 2 deletions(-) > > -- > 2.25.1 >
I still think this is very interesting. And your performance numbers are looking good. However, it limits the size of a mempool to 4 GB. As previously discussed, the max mempool size can be increased by multiplying the index with a constant. I would suggest using sizeof(uintptr_t) as the constant multiplier, so the mempool can hold objects of any size divisible by sizeof(uintptr_t). And it would be silly to use a mempool to hold objects smaller than sizeof(uintptr_t). How does the performance look if you multiply the index by sizeof(uintptr_t)? Med venlig hilsen / Kind regards, -Morten Brørup