Hi Dharmik, > > > >>> > >>>> Current mempool per core cache implementation stores pointers to mbufs > >>>> On 64b architectures, each pointer consumes 8B > >>>> This patch replaces it with index-based implementation, > >>>> where in each buffer is addressed by (pool base address + index) > >>>> It reduces the amount of memory/cache required for per core cache > >>>> > >>>> L3Fwd performance testing reveals minor improvements in the cache > >>>> performance (L1 and L2 misses reduced by 0.60%) > >>>> with no change in throughput > >>> > >>> I feel really sceptical about that patch and the whole idea in general: > >>> - From what I read above there is no real performance improvement > >>> observed. > >>> (In fact on my IA boxes mempool_perf_autotest reports ~20% slowdown, > >>> see below for more details). > >> > >> Currently, the optimizations (loop unroll and vectorization) are only > >> implemented for ARM64. > >> Similar optimizations can be implemented for x86 platforms which should > >> close the performance gap > >> and in my understanding should give better performance for a bulk size of > >> 32. > > > > Might be, but I still don't see the reason for such effort. > > As you mentioned there is no performance improvement in 'real' apps: l3fwd, > > etc. > > on ARM64 even with vectorized version of the code. > > > > IMO, even without performance improvement, it is advantageous because the > same performance is being achieved > with less memory and cache utilization using the patch. > > >>> - Space utilization difference looks neglectable too. > >> > >> Sorry, I did not understand this point. > > > > As I understand one of the expectations from that patch was: > > reduce memory/cache required, which should improve cache utilization > > (less misses, etc.). > > Though I think such improvements would be neglectable and wouldn't > > cause any real performance gain. > > The cache utilization performance numbers are for the l3fwd app, which might > not be bottlenecked at the mempool per core cache. > Theoretically, this patch enables storing twice the number of objects in the > cache as compared to the original implementation.
It saves you 4 just bytes per mbuf. Even for simple l2fwd-like workload we access ~100 bytes per mbuf. Let's do a simplistic estimation of number of affected cache-lines l for l2fwd. For bulk of 32 packets, assuming 64B per cache-line and 16B per HW desc: number of cache-lines accessed cache with pointers / cache with indexes mempool_get: (32*8)/64=4 / (32*4)/64=2 RX (read HW desc): (32*16)/64=8 / (32*16)/64=8 RX (write mbuf fields, 1st cache line): (32*64)/64=3 / (32*64)/64=32 update mac addrs: (32*64)/64=32 / (32*64)/64=32 TX (write HW desc): (32*16)/64=8 / (32*16)/64=8 free mbufs (read 2nd mbuf cache line): (32*64)/64=32 / (32*64)/64=32 mempool_put: (32*8)/64=4 / (32*4)/64=2 total: 120 116 So, if my calculations are correct, max estimated gain for cache utilization would be: (120-116)*100/120=3.33% Note that numbers are for over-simplistic usage scenario. In more realistic ones, when we have to touch more cache-lines per packet, that difference would be even less noticeable. So I really doubt we will see some noticeable improvements in terms of cache utilization with that patch. > > > >>> - The change introduces a new build time config option with a major > >>> limitation: > >>> All memzones in a pool have to be within the same 4GB boundary. > >>> To address it properly, extra changes will be required in > >>> init(/populate) part of the code. > >> > >> I agree to the above mentioned challenges and I am currently working on > >> resolving these issues. > > > > I still think that to justify such changes some really noticeable > > performance > > improvement needs to be demonstrated: double-digit speedup for > > l3fwd/ipsec-secgw/... > > Otherwise it just not worth the hassle. > > > > Like I mentioned earlier, the app might not be bottlenecked at the mempool > per core cache. > That could be the reason the numbers with l3fwd don’t fully show the > advantage of the patch. As I said above, I don’t think we'll see any real advantage here. But feel free to pick-up different app and prove me wrong. After all we have plenty of sample apps that do provide enough pressure on the cache: l3fwd-acl, ipsec-secgw. Or you can even apply these patches from Sean: https://patches.dpdk.org/project/dpdk/list/?series=20999 to run l3fwd with configurable routes. That should help you to make it cache-bound. > I’m seeing double-digit improvement with mempool_perf_autotest which should > not be ignored. And for other we are seeing double digit degradation. So far the whole idea doesn't look promising at all, at least to me. Konstantin