On Thu, 9 Nov 2023 11:45:46 +0100 Morten Brørup <m...@smartsharesystems.com> wrote:
> +TO: Andrew, mempool maintainer > > > From: Morten Brørup [mailto:m...@smartsharesystems.com] > > Sent: Monday, 6 November 2023 11.29 > > > > > From: Bruce Richardson [mailto:bruce.richard...@intel.com] > > > Sent: Monday, 6 November 2023 10.45 > > > > > > On Sat, Nov 04, 2023 at 06:29:40PM +0100, Morten Brørup wrote: > > > > I tried a little experiment, which gave a 25 % improvement in > > mempool > > > > perf tests for long bursts (n_get_bulk=32 n_put_bulk=32 n_keep=512 > > > > constant_n=0) on a Xeon E5-2620 v4 based system. > > > > > > > > This is the concept: > > > > > > > > If all accesses to the mempool driver goes through the mempool > > cache, > > > > we can ensure that these bulk load/stores are always CPU cache > > > aligned, > > > > by using cache->size when loading/storing to the mempool driver. > > > > > > > > Furthermore, it is rumored that most applications use the default > > > > mempool cache size, so if the driver tests for that specific value, > > > > it can use rte_memcpy(src,dst,N) with N known at build time, > > allowing > > > > optimal performance for copying the array of objects. > > > > > > > > Unfortunately, I need to change the flush threshold from 1.5 to 2 > > to > > > > be able to always use cache->size when loading/storing to the > > mempool > > > > driver. > > > > > > > > What do you think? > > It's the concept of accessing the underlying mempool in entire cache lines I > am seeking feedback for. > > The provided code is just an example, mainly for testing performance of the > concept. > > > > > > > > > PS: If we can't get rid of the mempool cache size threshold factor, > > > > we really need to expose it through public APIs. A job for another > > > day. > > The concept that a mempool per-lcore cache can hold more objects than its > size is extremely weird, and certainly unexpected by any normal developer. > And thus it is likely to cause runtime errors for applications designing > tightly sized mempools. > > So, if we move forward with this RFC, I propose eliminating the threshold > factor, so the mempool per-lcore caches cannot hold more objects than their > size. > When doing this, we might also choose to double RTE_MEMPOOL_CACHE_MAX_SIZE, > to prevent any performance degradation. > > > > > > > > > Signed-off-by: Morten Brørup <m...@smartsharesystems.com> > > > > --- > > > Interesting, thanks. > > > > > > Out of interest, is there any different in performance you observe if > > > using > > > regular libc memcpy vs rte_memcpy for the ring copies? Since the copy > > > amount is constant, a regular memcpy call should be expanded by the > > > compiler itself, and so should be pretty efficient. > > > > I ran some tests without patching rte_ring_elem_pvt.h, i.e. without > > introducing the constant-size copy loop. I got the majority of the > > performance gain at this point. > > > > At this point, both pointers are CPU cache aligned when refilling the > > mempool cache, and the destination pointer is CPU cache aligned when > > draining the mempool cache. > > > > In other words: When refilling the mempool cache, it is both loading > > and storing entire CPU cache lines. And when draining, it is storing > > entire CPU cache lines. > > > > > > Adding the fixed-size copy loop provided an additional performance > > gain. I didn't test other constant-size copy methods than rte_memcpy. > > > > rte_memcpy should have optimal conditions in this patch, because N is > > known to be 512 * 8 = 4 KiB at build time. Furthermore, both pointers > > are CPU cache aligned when refilling the mempool cache, and the > > destination pointer is CPU cache aligned when draining the mempool > > cache. I don't recall if pointer alignment matters for rte_memcpy, > > though. > > > > The memcpy in libc (or more correctly: intrinsic to the compiler) will > > do non-temporal copying for large sizes, and I don't know what that > > threshold is, so I think rte_memcpy is the safe bet here. Especially if > > someone builds DPDK with a larger mempool cache size than 512 objects. > > > > On the other hand, non-temporal access to the objects in the ring might > > be beneficial if the ring is so large that they go cold before the > > application loads them from the ring again. > Patch makes sense. Would prefer use of memcpy(), rte_memcpy usage should decline and it makes no sense here, compiler inlining will be the same. Also, patch no longer applies to main so needs rebase.