RE: [PATCH 1/1] mempool: implement index-based per core cache

Ananyev, Konstantin Fri, 21 Jan 2022 03:25:45 -0800


Hi Dharmik,
> >
> >>>
> >>>> Current mempool per core cache implementation stores pointers to mbufs
> >>>> On 64b architectures, each pointer consumes 8B
> >>>> This patch replaces it with index-based implementation,
> >>>> where in each buffer is addressed by (pool base address + index)
> >>>> It reduces the amount of memory/cache required for per core cache
> >>>>
> >>>> L3Fwd performance testing reveals minor improvements in the cache
> >>>> performance (L1 and L2 misses reduced by 0.60%)
> >>>> with no change in throughput
> >>>
> >>> I feel really sceptical about that patch and the whole idea in general:
> >>> - From what I read above there is no real performance improvement 
> >>> observed.
> >>> (In fact on my IA boxes mempool_perf_autotest reports ~20% slowdown,
> >>> see below for more details).
> >>
> >> Currently, the optimizations (loop unroll and vectorization) are only 
> >> implemented for ARM64.
> >> Similar optimizations can be implemented for x86 platforms which should 
> >> close the performance gap
> >> and in my understanding should give better performance for a bulk size of 
> >> 32.
> >
> > Might be, but I still don't see the reason for such effort.
> > As you mentioned there is no performance improvement in 'real' apps: l3fwd, 
> > etc.
> > on ARM64 even with vectorized version of the code.
> >
> 
> IMO, even without performance improvement, it is advantageous because the 
> same performance is being achieved
> with less memory and cache utilization using the patch.
> 
> >>> - Space utilization difference looks neglectable too.
> >>
> >> Sorry, I did not understand this point.
> >
> > As I understand one of the expectations from that patch was:
> > reduce memory/cache required, which should improve cache utilization
> > (less misses, etc.).
> > Though I think such improvements would be neglectable and wouldn't
> > cause any real performance gain.
> 
> The cache utilization performance numbers are for the l3fwd app, which might 
> not be bottlenecked at the mempool per core cache.
> Theoretically, this patch enables storing twice the number of objects in the 
> cache as compared to the original implementation.


It saves you 4 just bytes per mbuf.
Even for simple l2fwd-like workload we access ~100 bytes per mbuf.
Let's do a simplistic estimation of  number of affected cache-lines l for 
l2fwd. 
For bulk of 32 packets, assuming 64B per cache-line and 16B per HW desc:

                                                                       number 
of cache-lines accessed 
                                                                   cache with 
pointers / cache with indexes 
mempool_get:                                            (32*8)/64=4          /  
(32*4)/64=2
RX (read HW desc):                                    (32*16)/64=8       /   
(32*16)/64=8
RX (write mbuf fields, 1st cache line):    (32*64)/64=3       /   (32*64)/64=32
update mac addrs:                                     (32*64)/64=32     /   
(32*64)/64=32   
TX (write HW desc):                                   (32*16)/64=8       /   
(32*16)/64=8
free mbufs (read 2nd mbuf cache line): (32*64)/64=32    /   (32*64)/64=32   
mempool_put:                                            (32*8)/64=4        /    
(32*4)/64=2
total:                                                             120          
                   116

So, if my calculations are correct, max estimated gain for cache utilization 
would be:
(120-116)*100/120=3.33% 
Note that numbers are for over-simplistic usage scenario.
In more realistic ones, when we have to touch more cache-lines per packet,
that difference would be even less noticeable.
So I really doubt we will see some noticeable improvements in terms of cache 
utilization
with that patch.

> >
> >>> - The change introduces a new build time config option with a major 
> >>> limitation:
> >>>  All memzones in a pool have to be within the same 4GB boundary.
> >>>  To address it properly, extra changes will be required in 
> >>> init(/populate) part of the code.
> >>
> >> I agree to the above mentioned challenges and I am currently working on 
> >> resolving these issues.
> >
> > I still think that to justify such changes some really noticeable 
> > performance
> > improvement needs to be demonstrated: double-digit speedup for 
> > l3fwd/ipsec-secgw/...
> > Otherwise it just not worth the hassle.
> >
> 
> Like I mentioned earlier, the app might not be bottlenecked at the mempool 
> per core cache.
> That could be the reason the numbers with l3fwd don’t fully show the 
> advantage of the patch.

As I said above, I don’t think we'll see any real advantage here.
But feel free to pick-up different app and prove me wrong.
After all we have plenty of sample apps that do provide enough
pressure on the cache: l3fwd-acl, ipsec-secgw.
Or you can even apply these patches from Sean:
https://patches.dpdk.org/project/dpdk/list/?series=20999
to run l3fwd with configurable routes.
That should help you to make it cache-bound.

> I’m seeing double-digit improvement with mempool_perf_autotest which should 
> not be ignored.

And for other we are seeing double digit degradation.
So far the whole idea doesn't look promising at all, at least to me.
Konstantin

RE: [PATCH 1/1] mempool: implement index-based per core cache

Reply via email to