Hi,

Thank you for your valuable review comments and suggestions!

I will be sending out a v2 in which I have increased the size of the mempool to 
32GB by using division by sizeof(uintptr_t).
However, I am seeing ~5% performance degradation with mempool_perf_autotest 
(for bulk size of 32) with this change
when compared to the base performance.
Earlier, without this change, I was seeing an improvement of ~13% compared to 
base performance. So, this is a significant degradation.
I would appreciate your review comments on v2.

Thank you!

> On Jan 10, 2022, at 12:38 AM, Jerin Jacob <jerinjac...@gmail.com> wrote:
> 
> On Sat, Jan 8, 2022 at 3:07 PM Morten Brørup <m...@smartsharesystems.com> 
> wrote:
>> 
>>> From: Bruce Richardson [mailto:bruce.richard...@intel.com]
>>> Sent: Friday, 7 January 2022 14.51
>>> 
>>> On Fri, Jan 07, 2022 at 12:29:23PM +0100, Morten Brørup wrote:
>>>>> From: Bruce Richardson [mailto:bruce.richard...@intel.com]
>>>>> Sent: Friday, 7 January 2022 12.16
>>>>> 
>>>>> On Sat, Dec 25, 2021 at 01:16:03AM +0100, Morten Brørup wrote:
>>>>>>> From: Dharmik Thakkar [mailto:dharmik.thak...@arm.com] Sent:
>>>>> Friday, 24
>>>>>>> December 2021 23.59
>>>>>>> 
>>>>>>> Current mempool per core cache implementation stores pointers
>>> to
>>>>> mbufs
>>>>>>> On 64b architectures, each pointer consumes 8B This patch
>>> replaces
>>>>> it
>>>>>>> with index-based implementation, where in each buffer is
>>> addressed
>>>>> by
>>>>>>> (pool base address + index) It reduces the amount of
>>> memory/cache
>>>>>>> required for per core cache
>>>>>>> 
>>>>>>> L3Fwd performance testing reveals minor improvements in the
>>> cache
>>>>>>> performance (L1 and L2 misses reduced by 0.60%) with no change
>>> in
>>>>>>> throughput
>>>>>>> 
>>>>>>> Micro-benchmarking the patch using mempool_perf_test shows
>>>>> significant
>>>>>>> improvement with majority of the test cases
>>>>>>> 
>>>>>> 
>>>>>> I still think this is very interesting. And your performance
>>> numbers
>>>>> are
>>>>>> looking good.
>>>>>> 
>>>>>> However, it limits the size of a mempool to 4 GB. As previously
>>>>>> discussed, the max mempool size can be increased by multiplying
>>> the
>>>>> index
>>>>>> with a constant.
>>>>>> 
>>>>>> I would suggest using sizeof(uintptr_t) as the constant
>>> multiplier,
>>>>> so
>>>>>> the mempool can hold objects of any size divisible by
>>>>> sizeof(uintptr_t).
>>>>>> And it would be silly to use a mempool to hold objects smaller
>>> than
>>>>>> sizeof(uintptr_t).
>>>>>> 
>>>>>> How does the performance look if you multiply the index by
>>>>>> sizeof(uintptr_t)?
>>>>>> 
>>>>> 
>>>>> Each mempool entry is cache aligned, so we can use that if we want
>>> a
>>>>> bigger
>>>>> multiplier.
>>>> 
>>>> Thanks for chiming in, Bruce.
>>>> 
>>>> Please also read this discussion about the multiplier:
>>>> http://inbox.dpdk.org/dev/calbae1prqyyog96f6ecew1vpf3toh1h7mzzuliy95z9xjbr...@mail.gmail.com/
>>>> 
>>> 
>>> I actually wondered after I had sent the email whether we had indeed an
>>> option to disable the cache alignment or not! Thanks for pointing out
>>> that
>>> we do. This brings a couple additional thoughts:
>>> 
>>> * Using indexes for the cache should probably be a runtime flag rather
>>> than
>>>  a build-time one.
>>> * It would seem reasonable to me to disallow use of the indexed-cache
>>> flag
>>>  and the non-cache aligned flag simultaneously.
>>> * On the offchance that that restriction is unacceptable, then we can
>>>  make things a little more complicated by doing a runtime computation
>>> of
>>>  the "index-shiftwidth" to use.
>>> 
>>> Overall, I think defaulting to cacheline shiftwidth and disallowing
>>> index-based addressing when using unaligned buffers is simplest and
>>> easiest
>>> unless we can come up with a valid usecase for needing more than that.
>>> 
>>> /Bruce
>> 
>> This feature is a performance optimization.
>> 
>> With that in mind, it should not introduce function pointers or similar 
>> run-time checks or in the fast path, to determine what kind of cache to use 
>> per mempool. And if an index multiplier is implemented, it should be a 
>> compile time constant, probably something between sizeof(uintptr_t) or 
>> RTE_MEMPOOL_ALIGN (=RTE_CACHE_LINE_SIZE).
>> 
>> The patch comes with a tradeoff between better performance and limited 
>> mempool size, and possibly some limitations regarding very small objects 
>> that are not cache line aligned to avoid wasting memory 
>> (RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ).
>> 
>> With no multiplier, the only tradeoff is that the mempool size is limited to 
>> 4 GB.
>> 
>> If the multiplier is small (i.e. 8 bytes) the only tradeoff is that the 
>> mempool size is limited to 32 GB. (And a waste of memory for objects smaller 
>> than 8 byte; but I don't think anyone would use a mempool to hold objects 
>> smaller than 8 byte.)
>> 
>> If the multiplier is larger (i.e. 64 bytes cache line size), the mempool 
>> size is instead limited to 256 GB, but RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ has 
>> no effect.
>> 
>> Note: 32 bit platforms have no benefit from this patch: The pointer already 
>> only uses 4 bytes, so replacing the pointer with a 4 byte index makes no 
>> difference.
>> 
>> 
>> Since this feature is a performance optimization only, and doesn't provide 
>> any new features, I don't mind it being a compile time option.
>> 
>> If this feature is a compile time option, and the mempool library is 
>> compiled with the large multiplier, then RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ 
>> could be made undefined in the public header file, so compilation of 
>> applications using the flag will fail. And rte_mempool_create() could 
>> RTE_ASSERT() that RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ is not set in its flags 
>> parameter, or emit a warning about the flag being ignored. Obviously, 
>> rte_mempool_create() should also RTE_ASSERT() that the mempool is not larger 
>> than the library supports, possibly emitting a message that the mempool 
>> library should be built without this feature to support the larger mempool.
>> 
>> Here is another thought: If only exotic applications use mempools larger 
>> than 32 GB, this would be a generally acceptable limit, and DPDK should use 
>> index-based cache as default, making the opposite (i.e. pointer-based cache) 
>> a compile time option instead. A similar decision was recently made for 
>> limiting the RTE_MAX_LCORE default.
>> 
>> 
>> Although DPDK is moving away from compile time options in order to better 
>> support Linux distros, there should be a general exception for performance 
>> and memory optimizations. Otherwise, network appliance vendors will inherit 
>> the increasing amount of DPDK bloat, and we (network appliance vendors) will 
>> eventually be forced to fork DPDK to get rid of the bloat and achieve the 
>> goals originally intended by DPDK.
> 
> Agree with Morten's view on this.
> 
>> If anyone disagrees with the principle about a general exception for 
>> performance and memory optimizations, I would like to pass on the decision 
>> to the Techboard!
>> 

Reply via email to