On Sat, Nov 04, 2023 at 06:29:40PM +0100, Morten Brørup wrote: > I tried a little experiment, which gave a 25 % improvement in mempool > perf tests for long bursts (n_get_bulk=32 n_put_bulk=32 n_keep=512 > constant_n=0) on a Xeon E5-2620 v4 based system. > > This is the concept: > > If all accesses to the mempool driver goes through the mempool cache, > we can ensure that these bulk load/stores are always CPU cache aligned, > by using cache->size when loading/storing to the mempool driver. > > Furthermore, it is rumored that most applications use the default > mempool cache size, so if the driver tests for that specific value, > it can use rte_memcpy(src,dst,N) with N known at build time, allowing > optimal performance for copying the array of objects. > > Unfortunately, I need to change the flush threshold from 1.5 to 2 to > be able to always use cache->size when loading/storing to the mempool > driver. > > What do you think? > > PS: If we can't get rid of the mempool cache size threshold factor, > we really need to expose it through public APIs. A job for another day. > > Signed-off-by: Morten Brørup <m...@smartsharesystems.com> > --- Interesting, thanks.
Out of interest, is there any different in performance you observe if using regular libc memcpy vs rte_memcpy for the ring copies? Since the copy amount is constant, a regular memcpy call should be expanded by the compiler itself, and so should be pretty efficient. /Bruce