Hi Konstantin,
> -----Original Message----- > From: Ananyev, Konstantin <konstantin.anan...@intel.com> > Sent: Thursday, October 24, 2019 6:21 PM > To: Gavin Hu (Arm Technology China) <gavin...@arm.com>; > dev@dpdk.org > Cc: nd <n...@arm.com>; tho...@monjalon.net; > step...@networkplumber.org; hemant.agra...@nxp.com; > jer...@marvell.com; pbhagavat...@marvell.com; Honnappa Nagarahalli > <honnappa.nagaraha...@arm.com>; Ruifeng Wang (Arm Technology China) > <ruifeng.w...@arm.com>; Phil Yang (Arm Technology China) > <phil.y...@arm.com>; Steve Capper <steve.cap...@arm.com>; nd > <n...@arm.com>; nd <n...@arm.com> > Subject: RE: [dpdk-dev] [PATCH v7 2/7] eal: add the APIs to wait until equal > > > > Hi Gavin, > > > > > > The rte_wait_until_equal_xx APIs abstract the functionality of > > > > > > 'polling for a memory location to become equal to a given value'. > > > > > > > > > > > > Add the RTE_ARM_USE_WFE configuration entry for aarch64, > disabled > > > > > > by default. When it is enabled, the above APIs will call WFE > instruction > > > > > > to save CPU cycles and power. > > > > > > > > > > > > Signed-off-by: Gavin Hu <gavin...@arm.com> > > > > > > Reviewed-by: Ruifeng Wang <ruifeng.w...@arm.com> > > > > > > Reviewed-by: Steve Capper <steve.cap...@arm.com> > > > > > > Reviewed-by: Ola Liljedahl <ola.liljed...@arm.com> > > > > > > Reviewed-by: Honnappa Nagarahalli > > > <honnappa.nagaraha...@arm.com> > > > > > > Reviewed-by: Phil Yang <phil.y...@arm.com> > > > > > > Acked-by: Pavan Nikhilesh <pbhagavat...@marvell.com> > > > > > > --- > > > > > > config/arm/meson.build | 1 + > > > > > > config/common_base | 5 + > > > > > > .../common/include/arch/arm/rte_pause_64.h | 30 ++++++ > > > > > > lib/librte_eal/common/include/generic/rte_pause.h | 106 > > > > +++++++++++++++++++++ > > > > > > 4 files changed, 142 insertions(+) > > > > > > > > > > > > diff --git a/config/arm/meson.build b/config/arm/meson.build > > > > > > index 979018e..b4b4cac 100644 > > > > > > --- a/config/arm/meson.build > > > > > > +++ b/config/arm/meson.build > > > > > > @@ -26,6 +26,7 @@ flags_common_default = [ > > > > > > ['RTE_LIBRTE_AVP_PMD', false], > > > > > > > > > > > > ['RTE_SCHED_VECTOR', false], > > > > > > + ['RTE_ARM_USE_WFE', false], > > > > > > ] > > > > > > > > > > > > flags_generic = [ > > > > > > diff --git a/config/common_base b/config/common_base > > > > > > index 8ef75c2..8861713 100644 > > > > > > --- a/config/common_base > > > > > > +++ b/config/common_base > > > > > > @@ -111,6 +111,11 @@ > CONFIG_RTE_MAX_VFIO_CONTAINERS=64 > > > > > > CONFIG_RTE_MALLOC_DEBUG=n > > > > > > CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n > > > > > > CONFIG_RTE_USE_LIBBSD=n > > > > > > +# Use WFE instructions to implement the rte_wait_for_equal_xxx > APIs, > > > > > > +# calling these APIs put the cores in low power state while waiting > > > > > > +# for the memory address to become equal to the expected > value. > > > > > > +# This is supported only by aarch64. > > > > > > +CONFIG_RTE_ARM_USE_WFE=n > > > > > > > > > > > > # > > > > > > # Recognize/ignore the AVX/AVX512 CPU flags for > performance/power > > > > testing. > > > > > > diff --git > a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h > > > > b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h > > > > > > index 93895d3..dabde17 100644 > > > > > > --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h > > > > > > +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h > > > > > > @@ -1,5 +1,6 @@ > > > > > > /* SPDX-License-Identifier: BSD-3-Clause > > > > > > * Copyright(c) 2017 Cavium, Inc > > > > > > + * Copyright(c) 2019 Arm Limited > > > > > > */ > > > > > > > > > > > > #ifndef _RTE_PAUSE_ARM64_H_ > > > > > > @@ -17,6 +18,35 @@ static inline void rte_pause(void) > > > > > > asm volatile("yield" ::: "memory"); > > > > > > } > > > > > > > > > > > > +#ifdef RTE_ARM_USE_WFE > > > > > > +#define __WAIT_UNTIL_EQUAL(name, asm_op, wide, type) \ > > > > > > +static __rte_always_inline void \ > > > > > > +rte_wait_until_equal_##name(volatile type * addr, type > expected) \ > > > > > > +{ \ > > > > > > + type tmp; \ > > > > > > + asm volatile( \ > > > > > > + #asm_op " %" #wide "[tmp], %[addr]\n" \ > > > > > > + "cmp %" #wide "[tmp], %" #wide "[expected]\n" > \ > > > > > > + "b.eq 2f\n" \ > > > > > > + "sevl\n" \ > > > > > > + "1: wfe\n" \ > > > > > > + #asm_op " %" #wide "[tmp], %[addr]\n" \ > > > > > > + "cmp %" #wide "[tmp], %" #wide "[expected]\n" > \ > > > > > > + "bne 1b\n" \ > > > > > > + "2:\n" \ > > > > > > + : [tmp] "=&r" (tmp) \ > > > > > > + : [addr] "Q"(*addr), [expected] "r"(expected) \ > > > > > > + : "cc", "memory"); \ > > > > > > +} > > > > > > > > One more thought: > > > > Why do you need to write asm code for the whole procedure? > > > > Why not to do like linux kernel: > > > > define wfe() and sev() macros and use them inside normal C code? > > > > > > > > #define sev() asm volatile("sev" : : : "memory") > > > > #define wfe() asm volatile("wfe" : : : "memory") > > > > > > > > Then: > > > > rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected, > int > > > > memorder) > > > > { > > > > if (__atomic_load_n(addr, memorder) != expected) { > > > > sev(); > > > > do { > > > > wfe(); > > > > } while ((__atomic_load_n(addr, memorder) != expected); > > > > } > > > > } > > > > > > > > ? > > > A really good suggestion, I made corresponding changes to v8 already, > but it > > > missed a armv8 specific feature after internal discussion. > > > We call wfe to wait/sleep on the 'monitored' address, it will be waken up > > > upon someone write to the monitor address, so before wfe, we have to > call > > > load-exclusive instruction to 'monitor'. > > > __atomic_load_n - disassembled to "ldr" does not do so. We have to > use > > > "ldxrh" for relaxed mem ordering and "ldaxrh" for acquire ordering, in > > > example of 16-bit. > > Didn't realize that, sorry for confusion caused... Your comments are really helpful! Although we missed this point, anyway it helped to make the patches in a better shape(I personally likes the new v9 more than v7 😊), really appreciate, thanks! /Gavin > > > > > > > Let me re-think coming back to the full assembly procedure or > implementing > > > a 'load-exclusive' function. What do you think? > > After some thought I am leaning towards 'load-exclusive' function - > Hopefully it would help you avoid ras asm here and in other places. > What do you think? > Konstantin Yes, I implemented 'load-exclusive' function in v9, please have a review, thanks! Currently I did not make it 'rte_' as it is not used in other places than the rte_wait_until_equal APIs. Any more comments are welcome! /Gavin > > > > /Gavin > > Forgot to mention, kernel uses wfe() without preceding load-exclusive > instructions because: > > 1) it replies on the timer, to wake up, i.e. __delay() > > 2) explicit calling sev to send wake events, for all kinds of locks > > 3) IPI instructions. > > > > Our patches can't count on these events, due to of lack of these events or > performance impact. > > /Gavin > > > > > > > > > +/* Wait for *addr to be updated with expected value */ > > > > > > +__WAIT_UNTIL_EQUAL(relaxed_16, ldxrh, w, uint16_t) > > > > > > +__WAIT_UNTIL_EQUAL(acquire_16, ldaxrh, w, uint16_t) > > > > > > +__WAIT_UNTIL_EQUAL(relaxed_32, ldxr, w, uint32_t) > > > > > > +__WAIT_UNTIL_EQUAL(acquire_32, ldaxr, w, uint32_t) > > > > > > +__WAIT_UNTIL_EQUAL(relaxed_64, ldxr, x, uint64_t) > > > > > > +__WAIT_UNTIL_EQUAL(acquire_64, ldaxr, x, uint64_t) > > > > > > +#endif > > > > > > + > > > > > > #ifdef __cplusplus > > > > > > } > > > > > > #endif