On Sat, Jan 8, 2022 at 3:07 PM Morten Brørup <m...@smartsharesystems.com> wrote: > > > From: Bruce Richardson [mailto:bruce.richard...@intel.com] > > Sent: Friday, 7 January 2022 14.51 > > > > On Fri, Jan 07, 2022 at 12:29:23PM +0100, Morten Brørup wrote: > > > > From: Bruce Richardson [mailto:bruce.richard...@intel.com] > > > > Sent: Friday, 7 January 2022 12.16 > > > > > > > > On Sat, Dec 25, 2021 at 01:16:03AM +0100, Morten Brørup wrote: > > > > > > From: Dharmik Thakkar [mailto:dharmik.thak...@arm.com] Sent: > > > > Friday, 24 > > > > > > December 2021 23.59 > > > > > > > > > > > > Current mempool per core cache implementation stores pointers > > to > > > > mbufs > > > > > > On 64b architectures, each pointer consumes 8B This patch > > replaces > > > > it > > > > > > with index-based implementation, where in each buffer is > > addressed > > > > by > > > > > > (pool base address + index) It reduces the amount of > > memory/cache > > > > > > required for per core cache > > > > > > > > > > > > L3Fwd performance testing reveals minor improvements in the > > cache > > > > > > performance (L1 and L2 misses reduced by 0.60%) with no change > > in > > > > > > throughput > > > > > > > > > > > > Micro-benchmarking the patch using mempool_perf_test shows > > > > significant > > > > > > improvement with majority of the test cases > > > > > > > > > > > > > > > > I still think this is very interesting. And your performance > > numbers > > > > are > > > > > looking good. > > > > > > > > > > However, it limits the size of a mempool to 4 GB. As previously > > > > > discussed, the max mempool size can be increased by multiplying > > the > > > > index > > > > > with a constant. > > > > > > > > > > I would suggest using sizeof(uintptr_t) as the constant > > multiplier, > > > > so > > > > > the mempool can hold objects of any size divisible by > > > > sizeof(uintptr_t). > > > > > And it would be silly to use a mempool to hold objects smaller > > than > > > > > sizeof(uintptr_t). > > > > > > > > > > How does the performance look if you multiply the index by > > > > > sizeof(uintptr_t)? > > > > > > > > > > > > > Each mempool entry is cache aligned, so we can use that if we want > > a > > > > bigger > > > > multiplier. > > > > > > Thanks for chiming in, Bruce. > > > > > > Please also read this discussion about the multiplier: > > > http://inbox.dpdk.org/dev/calbae1prqyyog96f6ecew1vpf3toh1h7mzzuliy95z9xjbr...@mail.gmail.com/ > > > > > > > I actually wondered after I had sent the email whether we had indeed an > > option to disable the cache alignment or not! Thanks for pointing out > > that > > we do. This brings a couple additional thoughts: > > > > * Using indexes for the cache should probably be a runtime flag rather > > than > > a build-time one. > > * It would seem reasonable to me to disallow use of the indexed-cache > > flag > > and the non-cache aligned flag simultaneously. > > * On the offchance that that restriction is unacceptable, then we can > > make things a little more complicated by doing a runtime computation > > of > > the "index-shiftwidth" to use. > > > > Overall, I think defaulting to cacheline shiftwidth and disallowing > > index-based addressing when using unaligned buffers is simplest and > > easiest > > unless we can come up with a valid usecase for needing more than that. > > > > /Bruce > > This feature is a performance optimization. > > With that in mind, it should not introduce function pointers or similar > run-time checks or in the fast path, to determine what kind of cache to use > per mempool. And if an index multiplier is implemented, it should be a > compile time constant, probably something between sizeof(uintptr_t) or > RTE_MEMPOOL_ALIGN (=RTE_CACHE_LINE_SIZE). > > The patch comes with a tradeoff between better performance and limited > mempool size, and possibly some limitations regarding very small objects that > are not cache line aligned to avoid wasting memory > (RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ). > > With no multiplier, the only tradeoff is that the mempool size is limited to > 4 GB. > > If the multiplier is small (i.e. 8 bytes) the only tradeoff is that the > mempool size is limited to 32 GB. (And a waste of memory for objects smaller > than 8 byte; but I don't think anyone would use a mempool to hold objects > smaller than 8 byte.) > > If the multiplier is larger (i.e. 64 bytes cache line size), the mempool size > is instead limited to 256 GB, but RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ has no > effect. > > Note: 32 bit platforms have no benefit from this patch: The pointer already > only uses 4 bytes, so replacing the pointer with a 4 byte index makes no > difference. > > > Since this feature is a performance optimization only, and doesn't provide > any new features, I don't mind it being a compile time option. > > If this feature is a compile time option, and the mempool library is compiled > with the large multiplier, then RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ could be > made undefined in the public header file, so compilation of applications > using the flag will fail. And rte_mempool_create() could RTE_ASSERT() that > RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ is not set in its flags parameter, or emit a > warning about the flag being ignored. Obviously, rte_mempool_create() should > also RTE_ASSERT() that the mempool is not larger than the library supports, > possibly emitting a message that the mempool library should be built without > this feature to support the larger mempool. > > Here is another thought: If only exotic applications use mempools larger than > 32 GB, this would be a generally acceptable limit, and DPDK should use > index-based cache as default, making the opposite (i.e. pointer-based cache) > a compile time option instead. A similar decision was recently made for > limiting the RTE_MAX_LCORE default. > > > Although DPDK is moving away from compile time options in order to better > support Linux distros, there should be a general exception for performance > and memory optimizations. Otherwise, network appliance vendors will inherit > the increasing amount of DPDK bloat, and we (network appliance vendors) will > eventually be forced to fork DPDK to get rid of the bloat and achieve the > goals originally intended by DPDK.
Agree with Morten's view on this. >If anyone disagrees with the principle about a general exception for >performance and memory optimizations, I would like to pass on the decision to >the Techboard! >