Hi Konstatin,

> On Jan 13, 2022, at 4:37 AM, Ananyev, Konstantin 
> <konstantin.anan...@intel.com> wrote:
> 
> 
> Hi Dharmik,
> 
>>> 
>>>> Current mempool per core cache implementation stores pointers to mbufs
>>>> On 64b architectures, each pointer consumes 8B
>>>> This patch replaces it with index-based implementation,
>>>> where in each buffer is addressed by (pool base address + index)
>>>> It reduces the amount of memory/cache required for per core cache
>>>> 
>>>> L3Fwd performance testing reveals minor improvements in the cache
>>>> performance (L1 and L2 misses reduced by 0.60%)
>>>> with no change in throughput
>>> 
>>> I feel really sceptical about that patch and the whole idea in general:
>>> - From what I read above there is no real performance improvement observed.
>>> (In fact on my IA boxes mempool_perf_autotest reports ~20% slowdown,
>>> see below for more details).
>> 
>> Currently, the optimizations (loop unroll and vectorization) are only 
>> implemented for ARM64.
>> Similar optimizations can be implemented for x86 platforms which should 
>> close the performance gap
>> and in my understanding should give better performance for a bulk size of 32.
> 
> Might be, but I still don't see the reason for such effort.
> As you mentioned there is no performance improvement in 'real' apps: l3fwd, 
> etc.
> on ARM64 even with vectorized version of the code.
> 

IMO, even without performance improvement, it is advantageous because the same 
performance is being achieved
with less memory and cache utilization using the patch.

>>> - Space utilization difference looks neglectable too.
>> 
>> Sorry, I did not understand this point.
> 
> As I understand one of the expectations from that patch was:
> reduce memory/cache required, which should improve cache utilization
> (less misses, etc.).
> Though I think such improvements would be neglectable and wouldn't
> cause any real performance gain.

The cache utilization performance numbers are for the l3fwd app, which might 
not be bottlenecked at the mempool per core cache.
Theoretically, this patch enables storing twice the number of objects in the 
cache as compared to the original implementation.

> 
>>> - The change introduces a new build time config option with a major 
>>> limitation:
>>>  All memzones in a pool have to be within the same 4GB boundary.
>>>  To address it properly, extra changes will be required in init(/populate) 
>>> part of the code.
>> 
>> I agree to the above mentioned challenges and I am currently working on 
>> resolving these issues.
> 
> I still think that to justify such changes some really noticeable performance
> improvement needs to be demonstrated: double-digit speedup for 
> l3fwd/ipsec-secgw/...  
> Otherwise it just not worth the hassle. 
> 

Like I mentioned earlier, the app might not be bottlenecked at the mempool per 
core cache.
That could be the reason the numbers with l3fwd don’t fully show the advantage 
of the patch.
I’m seeing double-digit improvement with mempool_perf_autotest which should not 
be ignored.

>>>  All that will complicate mempool code, will make it more error prone
>>>  and harder to maintain.
>>> But, as there is no real gain in return - no point to add such extra 
>>> complexity at all.
>>> 
>>> Konstantin
>>> 

Reply via email to