> From: Bruce Richardson [mailto:bruce.richard...@intel.com]
> Sent: Monday, 6 November 2023 10.45
> 
> On Sat, Nov 04, 2023 at 06:29:40PM +0100, Morten Brørup wrote:
> > I tried a little experiment, which gave a 25 % improvement in mempool
> > perf tests for long bursts (n_get_bulk=32 n_put_bulk=32 n_keep=512
> > constant_n=0) on a Xeon E5-2620 v4 based system.
> >
> > This is the concept:
> >
> > If all accesses to the mempool driver goes through the mempool cache,
> > we can ensure that these bulk load/stores are always CPU cache
> aligned,
> > by using cache->size when loading/storing to the mempool driver.
> >
> > Furthermore, it is rumored that most applications use the default
> > mempool cache size, so if the driver tests for that specific value,
> > it can use rte_memcpy(src,dst,N) with N known at build time, allowing
> > optimal performance for copying the array of objects.
> >
> > Unfortunately, I need to change the flush threshold from 1.5 to 2 to
> > be able to always use cache->size when loading/storing to the mempool
> > driver.
> >
> > What do you think?
> >
> > PS: If we can't get rid of the mempool cache size threshold factor,
> > we really need to expose it through public APIs. A job for another
> day.
> >
> > Signed-off-by: Morten Brørup <m...@smartsharesystems.com>
> > ---
> Interesting, thanks.
> 
> Out of interest, is there any different in performance you observe if
> using
> regular libc memcpy vs rte_memcpy for the ring copies? Since the copy
> amount is constant, a regular memcpy call should be expanded by the
> compiler itself, and so should be pretty efficient.

I ran some tests without patching rte_ring_elem_pvt.h, i.e. without introducing 
the constant-size copy loop. I got the majority of the performance gain at this 
point.

At this point, both pointers are CPU cache aligned when refilling the mempool 
cache, and the destination pointer is CPU cache aligned when draining the 
mempool cache.

In other words: When refilling the mempool cache, it is both loading and 
storing entire CPU cache lines. And when draining, it is storing entire CPU 
cache lines.


Adding the fixed-size copy loop provided an additional performance gain. I 
didn't test other constant-size copy methods than rte_memcpy.

rte_memcpy should have optimal conditions in this patch, because N is known to 
be 512 * 8 = 4 KiB at build time. Furthermore, both pointers are CPU cache 
aligned when refilling the mempool cache, and the destination pointer is CPU 
cache aligned when draining the mempool cache. I don't recall if pointer 
alignment matters for rte_memcpy, though.

The memcpy in libc (or more correctly: intrinsic to the compiler) will do 
non-temporal copying for large sizes, and I don't know what that threshold is, 
so I think rte_memcpy is the safe bet here. Especially if someone builds DPDK 
with a larger mempool cache size than 512 objects.

On the other hand, non-temporal access to the objects in the ring might be 
beneficial if the ring is so large that they go cold before the application 
loads them from the ring again.

Reply via email to