27/10/2021 08:34, Aman Kumar: > On Tue, Oct 26, 2021 at 9:44 PM Thomas Monjalon <tho...@monjalon.net> wrote: > > > 26/10/2021 17:56, Aman Kumar: > > > This patch provides a rte_memcpy* call with temporal stores. > > > Use -Dcpu_instruction_set=znverX with build to enable this API. > > > > > > Signed-off-by: Aman Kumar <aman.ku...@vvdntech.in> > > > --- > > > config/x86/meson.build | 2 + > > > lib/eal/x86/include/rte_memcpy.h | 114 +++++++++++++++++++++++++++++++ > > > > It looks better as C code. > > Do you achieve the same performance as the asm version? > > > > In a few corner cases assembly performed better, but overall we have very > similar perf observations. > > > > +#if defined RTE_MEMCPY_AMDEPYC > > [...] > > > +static __rte_always_inline void * > > > +rte_memcpy_aligned_tstore16_generic(void *dst, void *src, int len) > > > > So to be clear, an application will benefit of this optimization if > > 1/ DPDK is specifically compiled for AMD > > 2/ the application is compiled with above DPDK build (because of > > inlinining) > > > > I guess there is no good way to benefit from the optimization > > without specific compilation, because of inlining constraint. > > Another design, with less constraint but less performance, > > would be to have a function pointer assigned at runtime based on the CPU. > > > > You're right. We need to build DPDK and apps with this flag enabled to get > the benefit.
So the x86 packages, as in Linux distributions, won't have this optimization. > In future versions, we will try to adapt in a more dynamic way. Thanks. No, I was trying to say that unfortunately there is probably no solution.