Hi, Thank you for the comments!
Based on the suggestions, I tested the patch for single core L3Fwd performance with increased number of routes/flows (maximum 8K) to increase cache footprint. However, I don’t see much improvement with the patch. > On Jan 21, 2022, at 5:25 AM, Ananyev, Konstantin > <konstantin.anan...@intel.com> wrote: > > > > Hi Dharmik, >>> >>>>> >>>>>> Current mempool per core cache implementation stores pointers to mbufs >>>>>> On 64b architectures, each pointer consumes 8B >>>>>> This patch replaces it with index-based implementation, >>>>>> where in each buffer is addressed by (pool base address + index) >>>>>> It reduces the amount of memory/cache required for per core cache >>>>>> >>>>>> L3Fwd performance testing reveals minor improvements in the cache >>>>>> performance (L1 and L2 misses reduced by 0.60%) >>>>>> with no change in throughput >>>>> >>>>> I feel really sceptical about that patch and the whole idea in general: >>>>> - From what I read above there is no real performance improvement >>>>> observed. >>>>> (In fact on my IA boxes mempool_perf_autotest reports ~20% slowdown, >>>>> see below for more details). >>>> >>>> Currently, the optimizations (loop unroll and vectorization) are only >>>> implemented for ARM64. >>>> Similar optimizations can be implemented for x86 platforms which should >>>> close the performance gap >>>> and in my understanding should give better performance for a bulk size of >>>> 32. >>> >>> Might be, but I still don't see the reason for such effort. >>> As you mentioned there is no performance improvement in 'real' apps: l3fwd, >>> etc. >>> on ARM64 even with vectorized version of the code. >>> >> >> IMO, even without performance improvement, it is advantageous because the >> same performance is being achieved >> with less memory and cache utilization using the patch. >> >>>>> - Space utilization difference looks neglectable too. >>>> >>>> Sorry, I did not understand this point. >>> >>> As I understand one of the expectations from that patch was: >>> reduce memory/cache required, which should improve cache utilization >>> (less misses, etc.). >>> Though I think such improvements would be neglectable and wouldn't >>> cause any real performance gain. >> >> The cache utilization performance numbers are for the l3fwd app, which might >> not be bottlenecked at the mempool per core cache. >> Theoretically, this patch enables storing twice the number of objects in the >> cache as compared to the original implementation. > > It saves you 4 just bytes per mbuf. > Even for simple l2fwd-like workload we access ~100 bytes per mbuf. > Let's do a simplistic estimation of number of affected cache-lines l for > l2fwd. > For bulk of 32 packets, assuming 64B per cache-line and 16B per HW desc: > > number > of cache-lines accessed > cache with > pointers / cache with indexes > mempool_get: (32*8)/64=4 > / (32*4)/64=2 > RX (read HW desc): (32*16)/64=8 / > (32*16)/64=8 > RX (write mbuf fields, 1st cache line): (32*64)/64=3 / > (32*64)/64=32 > update mac addrs: (32*64)/64=32 / > (32*64)/64=32 > TX (write HW desc): (32*16)/64=8 / > (32*16)/64=8 > free mbufs (read 2nd mbuf cache line): (32*64)/64=32 / (32*64)/64=32 > mempool_put: (32*8)/64=4 / > (32*4)/64=2 > total: 120 > 116 > > So, if my calculations are correct, max estimated gain for cache utilization > would be: > (120-116)*100/120=3.33% > Note that numbers are for over-simplistic usage scenario. > In more realistic ones, when we have to touch more cache-lines per packet, > that difference would be even less noticeable. > So I really doubt we will see some noticeable improvements in terms of cache > utilization > with that patch. > >>> >>>>> - The change introduces a new build time config option with a major >>>>> limitation: >>>>> All memzones in a pool have to be within the same 4GB boundary. >>>>> To address it properly, extra changes will be required in init(/populate) >>>>> part of the code. >>>> >>>> I agree to the above mentioned challenges and I am currently working on >>>> resolving these issues. >>> >>> I still think that to justify such changes some really noticeable >>> performance >>> improvement needs to be demonstrated: double-digit speedup for >>> l3fwd/ipsec-secgw/... >>> Otherwise it just not worth the hassle. >>> >> >> Like I mentioned earlier, the app might not be bottlenecked at the mempool >> per core cache. >> That could be the reason the numbers with l3fwd don’t fully show the >> advantage of the patch. > > As I said above, I don’t think we'll see any real advantage here. > But feel free to pick-up different app and prove me wrong. > After all we have plenty of sample apps that do provide enough > pressure on the cache: l3fwd-acl, ipsec-secgw. > Or you can even apply these patches from Sean: > https://patches.dpdk.org/project/dpdk/list/?series=20999 > to run l3fwd with configurable routes. > That should help you to make it cache-bound. > Thank you, Konstantin! This patch was helpful. >> I’m seeing double-digit improvement with mempool_perf_autotest which should >> not be ignored. > > And for other we are seeing double digit degradation. > So far the whole idea doesn't look promising at all, at least to me. > Konstantin >