Re: [dpdk-dev] [PATCH v4] arch/arm: optimization for memcpy on AArch64

Herbert Guan Thu, 04 Jan 2018 02:24:06 -0800

Thanks for review and comments, Jerin.  A new version has been sent out for 
review with your comments applied and Acked-by added.


Best regards,
Herbert

> -----Original Message-----
> From: Jerin Jacob [mailto:[email protected]]
> Sent: Wednesday, January 3, 2018 21:35
> To: Herbert Guan <[email protected]>
> Cc: [email protected]
> Subject: Re: [PATCH v4] arch/arm: optimization for memcpy on AArch64
> 
> -----Original Message-----
> > Date: Thu, 21 Dec 2017 13:33:47 +0800
> > From: Herbert Guan <[email protected]>
> > To: [email protected], [email protected]
> > CC: Herbert Guan <[email protected]>
> > Subject: [PATCH v4] arch/arm: optimization for memcpy on AArch64
> > X-Mailer: git-send-email 1.8.3.1
> >
> > This patch provides an option to do rte_memcpy() using 'restrict'
> > qualifier, which can induce GCC to do optimizations by using more
> > efficient instructions, providing some performance gain over memcpy()
> > on some AArch64 platforms/enviroments.
> >
> > The memory copy performance differs between different AArch64
> > platforms. And a more recent glibc (e.g. 2.23 or later)
> > can provide a better memcpy() performance compared to old glibc
> > versions. It's always suggested to use a more recent glibc if
> > possible, from which the entire system can get benefit. If for some
> > reason an old glibc has to be used, this patch is provided for an
> > alternative.
> >
> > This implementation can improve memory copy on some AArch64
> > platforms, when an old glibc (e.g. 2.19, 2.17...) is being used.
> > It is disabled by default and needs "RTE_ARCH_ARM64_MEMCPY"
> > defined to activate. It's not always proving better performance
> > than memcpy() so users need to run DPDK unit test
> > "memcpy_perf_autotest" and customize parameters in "customization
> > section" in rte_memcpy_64.h for best performance.
> >
> > Compiler version will also impact the rte_memcpy() performance.
> > It's observed on some platforms and with the same code, GCC 7.2.0
> > compiled binary can provide better performance than GCC 4.8.5. It's
> > suggested to use GCC 5.4.0 or later.
> >
> > Signed-off-by: Herbert Guan <[email protected]>
> 
> Looks good. Find inline request for some minor changes.
> Feel free to add my Acked-by with those changes.
> 
> 
> > ---
> >  config/common_armv8a_linuxapp                      |   6 +
> >  .../common/include/arch/arm/rte_memcpy_64.h        | 287
> +++++++++++++++++++++
> >  2 files changed, 293 insertions(+)
> >
> > diff --git a/config/common_armv8a_linuxapp
> b/config/common_armv8a_linuxapp
> > index 6732d1e..8f0cbed 100644
> > --- a/config/common_armv8a_linuxapp
> > +++ b/config/common_armv8a_linuxapp
> > @@ -44,6 +44,12 @@ CONFIG_RTE_FORCE_INTRINSICS=y
> >  # to address minimum DMA alignment across all arm64 implementations.
> >  CONFIG_RTE_CACHE_LINE_SIZE=128
> >
> > +# Accelarate rte_memcpy.  Be sure to run unit test to determine the
> > +# best threshold in code.  Refer to notes in source file
> > +# (lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h) for more
> > +# info.
> > +CONFIG_RTE_ARCH_ARM64_MEMCPY=n
> > +
> >  CONFIG_RTE_LIBRTE_FM10K_PMD=n
> >  CONFIG_RTE_LIBRTE_SFC_EFX_PMD=n
> >  CONFIG_RTE_LIBRTE_AVP_PMD=n
> > diff --git a/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h
> b/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h
> > index b80d8ba..b269f34 100644
> > --- a/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h
> > +++ b/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h
> > @@ -42,6 +42,291 @@
> >
> >  #include "generic/rte_memcpy.h"
> >
> > +#ifdef RTE_ARCH_ARM64_MEMCPY
> > +#include <rte_common.h>
> > +#include <rte_branch_prediction.h>
> > +
> > +/*
> > + * The memory copy performance differs on different AArch64 micro-
> architectures.
> > + * And the most recent glibc (e.g. 2.23 or later) can provide a better
> memcpy()
> > + * performance compared to old glibc versions. It's always suggested to
> use a
> > + * more recent glibc if possible, from which the entire system can get
> benefit.
> > + *
> > + * This implementation improves memory copy on some aarch64 micro-
> architectures,
> > + * when an old glibc (e.g. 2.19, 2.17...) is being used. It is disabled by
> > + * default and needs "RTE_ARCH_ARM64_MEMCPY" defined to activate.
> It's not
> > + * always providing better performance than memcpy() so users need to
> run unit
> > + * test "memcpy_perf_autotest" and customize parameters in
> customization section
> > + * below for best performance.
> > + *
> > + * Compiler version will also impact the rte_memcpy() performance. It's
> observed
> > + * on some platforms and with the same code, GCC 7.2.0 compiled
> binaries can
> > + * provide better performance than GCC 4.8.5 compiled binaries.
> > + */
> > +
> > +/**************************************
> > + * Beginning of customization section
> > + **************************************/
> > +#define RTE_ARM64_MEMCPY_ALIGN_MASK 0x0F
> > +#ifndef RTE_ARCH_ARM64_MEMCPY_STRICT_ALIGN
> > +/* Only src unalignment will be treaed as unaligned copy */
> > +#define IS_UNALIGNED_COPY(dst, src) \
> 
> Better to to change to RTE_ARM64_MEMCPY_IS_UNALIGNED_COPY, as it is
> defined in public DPDK header file.
> 
> 
> > +   ((uintptr_t)(dst) & RTE_ARM64_MEMCPY_ALIGN_MASK)
> > +#else
> > +/* Both dst and src unalignment will be treated as unaligned copy */
> > +#define IS_UNALIGNED_COPY(dst, src) \
> > +   (((uintptr_t)(dst) | (uintptr_t)(src)) &
> RTE_ARM64_MEMCPY_ALIGN_MASK)
> 
> Same as above
> 
> > +#endif
> > +
> > +
> > +/*
> > + * If copy size is larger than threshold, memcpy() will be used.
> > + * Run "memcpy_perf_autotest" to determine the proper threshold.
> > + */
> > +#define RTE_ARM64_MEMCPY_ALIGNED_THRESHOLD
> ((size_t)(0xffffffff))
> > +#define RTE_ARM64_MEMCPY_UNALIGNED_THRESHOLD
> ((size_t)(0xffffffff))
> > +
> > +/*
> > + * The logic of USE_RTE_MEMCPY() can also be modified to best fit
> platform.
> > + */
> > +#define USE_RTE_MEMCPY(dst, src, n) \
> > +((!IS_UNALIGNED_COPY(dst, src) && n <=
> RTE_ARM64_MEMCPY_ALIGNED_THRESHOLD) \
> > +|| (IS_UNALIGNED_COPY(dst, src) && n <=
> RTE_ARM64_MEMCPY_UNALIGNED_THRESHOLD))
> > +
> > +
> > +/**************************************
> > + * End of customization section
> > + **************************************/
> > +#if defined(RTE_TOOLCHAIN_GCC)
> && !defined(RTE_AARCH64_SKIP_GCC_VERSION_CHECK)
> 
> To maintain consistency
> s/RTE_AARCH64_SKIP_GCC_VERSION_CHECK/RTE_ARM64_MEMCPY_SKIP_
> GCC_VERSION_CHECK
> 
> > +#if (GCC_VERSION < 50400)
> > +#warning "The GCC version is quite old, which may result in sub-optimal \
> > +performance of the compiled code. It is suggested that at least GCC 5.4.0 \
> > +be used."
> > +#endif
> > +#endif
> > +
> > +static __rte_always_inline void rte_mov16(uint8_t *dst, const uint8_t
> *src)
> 
> static __rte_always_inline
> void rte_mov16(uint8_t *dst, const uint8_t *src)
> 
> > +{
> > +   __uint128_t *dst128 = (__uint128_t *)dst;
> > +   const __uint128_t *src128 = (const __uint128_t *)src;
> > +   *dst128 = *src128;
> > +}
> > +
> > +static __rte_always_inline void rte_mov32(uint8_t *dst, const uint8_t
> *src)
> 
> See above
> 
> > +{
> > +   __uint128_t *dst128 = (__uint128_t *)dst;
> > +   const __uint128_t *src128 = (const __uint128_t *)src;
> > +   const __uint128_t x0 = src128[0], x1 = src128[1];
> > +   dst128[0] = x0;
> > +   dst128[1] = x1;
> > +}
> > +
> > +static __rte_always_inline void rte_mov48(uint8_t *dst, const uint8_t
> *src)
> > +{
> 
> See above
> 
> > +   __uint128_t *dst128 = (__uint128_t *)dst;
> > +   const __uint128_t *src128 = (const __uint128_t *)src;
> > +   const __uint128_t x0 = src128[0], x1 = src128[1], x2 = src128[2];
> > +   dst128[0] = x0;
> > +   dst128[1] = x1;
> > +   dst128[2] = x2;
> > +}
> > +
> > +static __rte_always_inline void rte_mov64(uint8_t *dst, const uint8_t
> *src)
> > +{
> 
> See above
> 
> > +   __uint128_t *dst128 = (__uint128_t *)dst;
> > +   const __uint128_t *src128 = (const __uint128_t *)src;
> > +   const __uint128_t
> > +           x0 = src128[0], x1 = src128[1], x2 = src128[2], x3 = src128[3];
> > +   dst128[0] = x0;
> > +   dst128[1] = x1;
> > +   dst128[2] = x2;
> > +   dst128[3] = x3;
> > +}
> > +
> > +static __rte_always_inline void rte_mov128(uint8_t *dst, const uint8_t
> *src)
> > +{
> 
> See above
> 
> > +   __uint128_t *dst128 = (__uint128_t *)dst;
> > +   const __uint128_t *src128 = (const __uint128_t *)src;
> > +   /* Keep below declaration & copy sequence for optimized
> instructions */
> > +   const __uint128_t
> > +           x0 = src128[0], x1 = src128[1], x2 = src128[2], x3 = src128[3];
> > +   dst128[0] = x0;
> > +   __uint128_t x4 = src128[4];
> > +   dst128[1] = x1;
> > +   __uint128_t x5 = src128[5];
> > +   dst128[2] = x2;
> > +   __uint128_t x6 = src128[6];
> > +   dst128[3] = x3;
> > +   __uint128_t x7 = src128[7];
> > +   dst128[4] = x4;
> > +   dst128[5] = x5;
> > +   dst128[6] = x6;
> > +   dst128[7] = x7;
> > +}
> > +
> > +static __rte_always_inline void rte_mov256(uint8_t *dst, const uint8_t
> *src)
> > +{
> 
> See above

Re: [dpdk-dev] [PATCH v4] arch/arm: optimization for memcpy on AArch64

Reply via email to