Hi Dharmik,
> >
> >>>
> >>>> Current mempool per core cache implementation stores pointers to mbufs
> >>>> On 64b architectures, each pointer consumes 8B
> >>>> This patch replaces it with index-based implementation,
> >>>> where in each buffer is addressed by (pool base address + index)
> >>>> It reduces the amount of memory/cache required for per core cache
> >>>>
> >>>> L3Fwd performance testing reveals minor improvements in the cache
> >>>> performance (L1 and L2 misses reduced by 0.60%)
> >>>> with no change in throughput
> >>>
> >>> I feel really sceptical about that patch and the whole idea in general:
> >>> - From what I read above there is no real performance improvement 
> >>> observed.
> >>> (In fact on my IA boxes mempool_perf_autotest reports ~20% slowdown,
> >>> see below for more details).
> >>
> >> Currently, the optimizations (loop unroll and vectorization) are only 
> >> implemented for ARM64.
> >> Similar optimizations can be implemented for x86 platforms which should 
> >> close the performance gap
> >> and in my understanding should give better performance for a bulk size of 
> >> 32.
> >
> > Might be, but I still don't see the reason for such effort.
> > As you mentioned there is no performance improvement in 'real' apps: l3fwd, 
> > etc.
> > on ARM64 even with vectorized version of the code.
> >
> 
> IMO, even without performance improvement, it is advantageous because the 
> same performance is being achieved
> with less memory and cache utilization using the patch.
> 
> >>> - Space utilization difference looks neglectable too.
> >>
> >> Sorry, I did not understand this point.
> >
> > As I understand one of the expectations from that patch was:
> > reduce memory/cache required, which should improve cache utilization
> > (less misses, etc.).
> > Though I think such improvements would be neglectable and wouldn't
> > cause any real performance gain.
> 
> The cache utilization performance numbers are for the l3fwd app, which might 
> not be bottlenecked at the mempool per core cache.
> Theoretically, this patch enables storing twice the number of objects in the 
> cache as compared to the original implementation.

It saves you 4 just bytes per mbuf.
Even for simple l2fwd-like workload we access ~100 bytes per mbuf.
Let's do a simplistic estimation of  number of affected cache-lines l for 
l2fwd. 
For bulk of 32 packets, assuming 64B per cache-line and 16B per HW desc:

                                                                       number 
of cache-lines accessed 
                                                                   cache with 
pointers / cache with indexes 
mempool_get:                                            (32*8)/64=4          /  
(32*4)/64=2
RX (read HW desc):                                    (32*16)/64=8       /   
(32*16)/64=8
RX (write mbuf fields, 1st cache line):    (32*64)/64=3       /   (32*64)/64=32
update mac addrs:                                     (32*64)/64=32     /   
(32*64)/64=32   
TX (write HW desc):                                   (32*16)/64=8       /   
(32*16)/64=8
free mbufs (read 2nd mbuf cache line): (32*64)/64=32    /   (32*64)/64=32   
mempool_put:                                            (32*8)/64=4        /    
(32*4)/64=2
total:                                                             120          
                   116

So, if my calculations are correct, max estimated gain for cache utilization 
would be:
(120-116)*100/120=3.33% 
Note that numbers are for over-simplistic usage scenario.
In more realistic ones, when we have to touch more cache-lines per packet,
that difference would be even less noticeable.
So I really doubt we will see some noticeable improvements in terms of cache 
utilization
with that patch.

> >
> >>> - The change introduces a new build time config option with a major 
> >>> limitation:
> >>>  All memzones in a pool have to be within the same 4GB boundary.
> >>>  To address it properly, extra changes will be required in 
> >>> init(/populate) part of the code.
> >>
> >> I agree to the above mentioned challenges and I am currently working on 
> >> resolving these issues.
> >
> > I still think that to justify such changes some really noticeable 
> > performance
> > improvement needs to be demonstrated: double-digit speedup for 
> > l3fwd/ipsec-secgw/...
> > Otherwise it just not worth the hassle.
> >
> 
> Like I mentioned earlier, the app might not be bottlenecked at the mempool 
> per core cache.
> That could be the reason the numbers with l3fwd don’t fully show the 
> advantage of the patch.

As I said above, I don’t think we'll see any real advantage here.
But feel free to pick-up different app and prove me wrong.
After all we have plenty of sample apps that do provide enough
pressure on the cache: l3fwd-acl, ipsec-secgw.
Or you can even apply these patches from Sean:
https://patches.dpdk.org/project/dpdk/list/?series=20999
to run l3fwd with configurable routes.
That should help you to make it cache-bound.

> I’m seeing double-digit improvement with mempool_perf_autotest which should 
> not be ignored.

And for other we are seeing double digit degradation.
So far the whole idea doesn't look promising at all, at least to me.
Konstantin

Reply via email to