26/10/2021 17:56, Aman Kumar: > This patch provides a rte_memcpy* call with temporal stores. > Use -Dcpu_instruction_set=znverX with build to enable this API. > > Signed-off-by: Aman Kumar <aman.ku...@vvdntech.in> > --- > config/x86/meson.build | 2 + > lib/eal/x86/include/rte_memcpy.h | 114 +++++++++++++++++++++++++++++++
It looks better as C code. Do you achieve the same performance as the asm version? > +#if defined RTE_MEMCPY_AMDEPYC [...] > +static __rte_always_inline void * > +rte_memcpy_aligned_tstore16_generic(void *dst, void *src, int len) So to be clear, an application will benefit of this optimization if 1/ DPDK is specifically compiled for AMD 2/ the application is compiled with above DPDK build (because of inlinining) I guess there is no good way to benefit from the optimization without specific compilation, because of inlining constraint. Another design, with less constraint but less performance, would be to have a function pointer assigned at runtime based on the CPU.