Hi, Thank you for your valuable review comments and suggestions!
I will be sending out a v2 in which I have increased the size of the mempool to 32GB by using division by sizeof(uintptr_t). However, I am seeing ~5% performance degradation with mempool_perf_autotest (for bulk size of 32) with this change when compared to the base performance. Earlier, without this change, I was seeing an improvement of ~13% compared to base performance. So, this is a significant degradation. I would appreciate your review comments on v2. Thank you! > On Jan 10, 2022, at 12:38 AM, Jerin Jacob <jerinjac...@gmail.com> wrote: > > On Sat, Jan 8, 2022 at 3:07 PM Morten Brørup <m...@smartsharesystems.com> > wrote: >> >>> From: Bruce Richardson [mailto:bruce.richard...@intel.com] >>> Sent: Friday, 7 January 2022 14.51 >>> >>> On Fri, Jan 07, 2022 at 12:29:23PM +0100, Morten Brørup wrote: >>>>> From: Bruce Richardson [mailto:bruce.richard...@intel.com] >>>>> Sent: Friday, 7 January 2022 12.16 >>>>> >>>>> On Sat, Dec 25, 2021 at 01:16:03AM +0100, Morten Brørup wrote: >>>>>>> From: Dharmik Thakkar [mailto:dharmik.thak...@arm.com] Sent: >>>>> Friday, 24 >>>>>>> December 2021 23.59 >>>>>>> >>>>>>> Current mempool per core cache implementation stores pointers >>> to >>>>> mbufs >>>>>>> On 64b architectures, each pointer consumes 8B This patch >>> replaces >>>>> it >>>>>>> with index-based implementation, where in each buffer is >>> addressed >>>>> by >>>>>>> (pool base address + index) It reduces the amount of >>> memory/cache >>>>>>> required for per core cache >>>>>>> >>>>>>> L3Fwd performance testing reveals minor improvements in the >>> cache >>>>>>> performance (L1 and L2 misses reduced by 0.60%) with no change >>> in >>>>>>> throughput >>>>>>> >>>>>>> Micro-benchmarking the patch using mempool_perf_test shows >>>>> significant >>>>>>> improvement with majority of the test cases >>>>>>> >>>>>> >>>>>> I still think this is very interesting. And your performance >>> numbers >>>>> are >>>>>> looking good. >>>>>> >>>>>> However, it limits the size of a mempool to 4 GB. As previously >>>>>> discussed, the max mempool size can be increased by multiplying >>> the >>>>> index >>>>>> with a constant. >>>>>> >>>>>> I would suggest using sizeof(uintptr_t) as the constant >>> multiplier, >>>>> so >>>>>> the mempool can hold objects of any size divisible by >>>>> sizeof(uintptr_t). >>>>>> And it would be silly to use a mempool to hold objects smaller >>> than >>>>>> sizeof(uintptr_t). >>>>>> >>>>>> How does the performance look if you multiply the index by >>>>>> sizeof(uintptr_t)? >>>>>> >>>>> >>>>> Each mempool entry is cache aligned, so we can use that if we want >>> a >>>>> bigger >>>>> multiplier. >>>> >>>> Thanks for chiming in, Bruce. >>>> >>>> Please also read this discussion about the multiplier: >>>> http://inbox.dpdk.org/dev/calbae1prqyyog96f6ecew1vpf3toh1h7mzzuliy95z9xjbr...@mail.gmail.com/ >>>> >>> >>> I actually wondered after I had sent the email whether we had indeed an >>> option to disable the cache alignment or not! Thanks for pointing out >>> that >>> we do. This brings a couple additional thoughts: >>> >>> * Using indexes for the cache should probably be a runtime flag rather >>> than >>> a build-time one. >>> * It would seem reasonable to me to disallow use of the indexed-cache >>> flag >>> and the non-cache aligned flag simultaneously. >>> * On the offchance that that restriction is unacceptable, then we can >>> make things a little more complicated by doing a runtime computation >>> of >>> the "index-shiftwidth" to use. >>> >>> Overall, I think defaulting to cacheline shiftwidth and disallowing >>> index-based addressing when using unaligned buffers is simplest and >>> easiest >>> unless we can come up with a valid usecase for needing more than that. >>> >>> /Bruce >> >> This feature is a performance optimization. >> >> With that in mind, it should not introduce function pointers or similar >> run-time checks or in the fast path, to determine what kind of cache to use >> per mempool. And if an index multiplier is implemented, it should be a >> compile time constant, probably something between sizeof(uintptr_t) or >> RTE_MEMPOOL_ALIGN (=RTE_CACHE_LINE_SIZE). >> >> The patch comes with a tradeoff between better performance and limited >> mempool size, and possibly some limitations regarding very small objects >> that are not cache line aligned to avoid wasting memory >> (RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ). >> >> With no multiplier, the only tradeoff is that the mempool size is limited to >> 4 GB. >> >> If the multiplier is small (i.e. 8 bytes) the only tradeoff is that the >> mempool size is limited to 32 GB. (And a waste of memory for objects smaller >> than 8 byte; but I don't think anyone would use a mempool to hold objects >> smaller than 8 byte.) >> >> If the multiplier is larger (i.e. 64 bytes cache line size), the mempool >> size is instead limited to 256 GB, but RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ has >> no effect. >> >> Note: 32 bit platforms have no benefit from this patch: The pointer already >> only uses 4 bytes, so replacing the pointer with a 4 byte index makes no >> difference. >> >> >> Since this feature is a performance optimization only, and doesn't provide >> any new features, I don't mind it being a compile time option. >> >> If this feature is a compile time option, and the mempool library is >> compiled with the large multiplier, then RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ >> could be made undefined in the public header file, so compilation of >> applications using the flag will fail. And rte_mempool_create() could >> RTE_ASSERT() that RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ is not set in its flags >> parameter, or emit a warning about the flag being ignored. Obviously, >> rte_mempool_create() should also RTE_ASSERT() that the mempool is not larger >> than the library supports, possibly emitting a message that the mempool >> library should be built without this feature to support the larger mempool. >> >> Here is another thought: If only exotic applications use mempools larger >> than 32 GB, this would be a generally acceptable limit, and DPDK should use >> index-based cache as default, making the opposite (i.e. pointer-based cache) >> a compile time option instead. A similar decision was recently made for >> limiting the RTE_MAX_LCORE default. >> >> >> Although DPDK is moving away from compile time options in order to better >> support Linux distros, there should be a general exception for performance >> and memory optimizations. Otherwise, network appliance vendors will inherit >> the increasing amount of DPDK bloat, and we (network appliance vendors) will >> eventually be forced to fork DPDK to get rid of the bloat and achieve the >> goals originally intended by DPDK. > > Agree with Morten's view on this. > >> If anyone disagrees with the principle about a general exception for >> performance and memory optimizations, I would like to pass on the decision >> to the Techboard! >>