On Fri, Sep 20, 2024 at 12:36 PM Mattias Rönnblom <mattias.ronnb...@ericsson.com> wrote: > > Provide build option to have functions in <rte_memcpy.h> delegate to > the standard compiler/libc memcpy(), instead of using the various > custom DPDK, handcrafted, per-architecture rte_memcpy() > implementations. > > A new meson build option 'use_cc_memcpy' is added. By default, the > traditional, custom DPDK rte_memcpy() implementation is used. > > The performance benefits of the custom DPDK rte_memcpy() > implementations have been diminishing with every compiler release, and > with current toolchains the use of a custom memcpy() implementation > may even be a liability. > > An additional benefit of this change is that compilers and static > analysis tools have an easier time detecting incorrect usage of > rte_memcpy() (e.g., buffer overruns, or overlapping source and > destination buffers). > > Signed-off-by: Mattias Rönnblom <mattias.ronnb...@ericsson.com> > Acked-by: Morten Brørup <m...@smartsharesystems.com>
I like this patch and the direction we are taking: stop reinvent memcpy and rely on compiler to optimize it. I have some comments on the implementation. - When I splitted headers in the early days of dpdk, the intention with arch-specific headers in EAL was to have them include the generic one, in all cases. It seems that, over time, x86 rte_memcpy.h (at least) deviated from this and stopped including generic/rte_memcpy.h... So in this current patch, I expect every arch specific headers first include generic/rte_memcpy.h, regardless of any arch-specific define coming from the configuration. An additional note on this, ARM32 and ARM64 have their own implementation in rte_memcpy_32.h resp. rte_memcpy_64.h, and I would check RTE_USE_CC_MEMCPY in each of them rather than in the top as ARM32 and ARM64 are like two different arches. - Now, looking at what was available for arches so far in DPDK: * ARM was relying by default on compiler implementation, with specific implementations for ARM32 and ARM64 available (see for more details below) => possible values (default first) RTE_USE_CC_MEMCPY = true / false * loongarch was relying on compiler implementation, with no specific implementations, => RTE_USE_CC_MEMCPY = true * ppc was relying on arch specific implementation, => RTE_USE_CC_MEMCPY = false * risc was relying on compiler implementation, with no specific implementations, => RTE_USE_CC_MEMCPY = true * x86 was relying on arch specific implementation, => RTE_USE_CC_MEMCPY = false We can't get a unified default value for a meson option and keep compat for all arches (except maybe introduce a "auto" value). Plus, disabling RTE_USE_CC_MEMCPY on loongarch and risc makes no sense, as there was never a specific implementation. My suggestion is to drop the meson option and instead just set RTE_USE_CC_MEMCPY in config/$arch/meson.build. Testers / interested users may edit config/$arch/meson.build on their own. - Additionnally, ARM people have introduced arch-specific implementation config options for memcpy in ARM32 resp. ARM64: RTE_ARCH_ARM_NEON_MEMCPY resp. RTE_ARCH_ARM64_MEMCPY. RTE_USE_CC_MEMCPY can replace those two options (we may keep some compat in case someone relied on those defines for arm). That removes the need for a RTE_CC_MEMCPY define. More comments below: [snip] > diff --git a/doc/guides/rel_notes/release_24_11.rst > b/doc/guides/rel_notes/release_24_11.rst > index 0ff70d9057..8be000294d 100644 > --- a/doc/guides/rel_notes/release_24_11.rst > +++ b/doc/guides/rel_notes/release_24_11.rst > @@ -55,6 +55,26 @@ New Features > Also, make sure to start the actual text at the margin. > ======================================================= > > +* **Compiler memcpy replaces custom DPDK implementation.** > + > + The memory copy functions of ``<rte_memcpy.h>`` now optionally > + delegates to the standard memcpy() function, implemented by the > + compiler and the C runtime (e.g., libc). > + > + In this release of DPDK, the handcrafted, per-architecture memory > + copy implementations are still the default. Compiler memcpy is > + enabled by setting the new ``use_cc_memcpy`` build option to true. > + > + The performance benefits of the custom DPDK rte_memcpy() > + implementations have been diminishing with every new compiler > + release, and with current toolchains the use of a custom memcpy() > + implementation may even result in worse performance than the > + standard memcpy(). > + > + An additional benefit of using compiler memcpy is that compilers and > + static analysis tools have an easier time detecting incorrect usage > + of rte_memcpy() (e.g., buffer overruns, or overlapping source and > + destination buffers). As explained in the RN comments, an entry should use the form: * **Add a title in the past tense with a full stop.** Add a short 1-2 sentence description in the past tense. The description should be enough to allow someone scanning the release notes to understand the new feature. It seems this note is a copy/paste of the commit log, please adjust the title and make the description shorter. > > Removed Items > ------------- [snip] > diff --git a/lib/eal/include/generic/rte_memcpy.h > b/lib/eal/include/generic/rte_memcpy.h > index e7f0f8eaa9..cfb0175bd2 100644 > --- a/lib/eal/include/generic/rte_memcpy.h > +++ b/lib/eal/include/generic/rte_memcpy.h > @@ -5,12 +5,19 @@ > #ifndef _RTE_MEMCPY_H_ > #define _RTE_MEMCPY_H_ > > +#ifdef __cplusplus > +extern "C" { > +#endif > + > /** > * @file > * > * Functions for vectorised implementation of memcpy(). > */ > > +#include <stdint.h> > +#include <string.h> I don't think those includes should go in a extern "C" { block. > + > /** > * Copy 16 bytes from one location to another using optimised > * instructions. The locations should not overlap. > @@ -35,8 +42,6 @@ rte_mov16(uint8_t *dst, const uint8_t *src); > static inline void > rte_mov32(uint8_t *dst, const uint8_t *src); > > -#ifdef __DOXYGEN__ > - This strange check was added as not all architectures provide rte_mov48 (/me slaps Adrien and Thomas). I think the CI reported no issue because of a problem in the next patch where all that is tested is RTE_USE_CC_MEMCPY = true combination. Still, the overall goal of this work is to drop the whole rte_memcpy thing in the future, so I think we can live with this #ifdef __DOXYGEN__ non sense hiding the absence of rte_mov48 in x86... > /** > * Copy 48 bytes from one location to another using optimised > * instructions. The locations should not overlap. > @@ -49,8 +54,6 @@ rte_mov32(uint8_t *dst, const uint8_t *src); > static inline void > rte_mov48(uint8_t *dst, const uint8_t *src); > > -#endif /* __DOXYGEN__ */ > - > /** > * Copy 64 bytes from one location to another using optimised > * instructions. The locations should not overlap. > @@ -87,8 +90,6 @@ rte_mov128(uint8_t *dst, const uint8_t *src); > static inline void > rte_mov256(uint8_t *dst, const uint8_t *src); > > -#ifdef __DOXYGEN__ > - > /** > * Copy bytes from one location to another. The locations must not overlap. > * > @@ -111,6 +112,52 @@ rte_mov256(uint8_t *dst, const uint8_t *src); > static void * > rte_memcpy(void *dst, const void *src, size_t n); > > -#endif /* __DOXYGEN__ */ Removing this DOXYGEN here should be ok. CI will tell us. > diff --git a/lib/eal/x86/include/meson.build b/lib/eal/x86/include/meson.build > index 52d2f8e969..09c2fe2485 100644 > --- a/lib/eal/x86/include/meson.build > +++ b/lib/eal/x86/include/meson.build > @@ -16,6 +16,7 @@ arch_headers = files( > 'rte_spinlock.h', > 'rte_vect.h', > ) > + Unrelated change. > arch_indirect_headers = files( > 'rte_atomic_32.h', > 'rte_atomic_64.h', -- David Marchand