Hi Gavin,
> > > > > The rte_wait_until_equal_xx APIs abstract the functionality of
> > > > > 'polling for a memory location to become equal to a given value'.
> > > > >
> > > > > Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
> > > > > by default. When it is enabled, the above APIs will call WFE 
> > > > > instruction
> > > > > to save CPU cycles and power.
> > > > >
> > > > > Signed-off-by: Gavin Hu <gavin...@arm.com>
> > > > > Reviewed-by: Ruifeng Wang <ruifeng.w...@arm.com>
> > > > > Reviewed-by: Steve Capper <steve.cap...@arm.com>
> > > > > Reviewed-by: Ola Liljedahl <ola.liljed...@arm.com>
> > > > > Reviewed-by: Honnappa Nagarahalli
> > <honnappa.nagaraha...@arm.com>
> > > > > Reviewed-by: Phil Yang <phil.y...@arm.com>
> > > > > Acked-by: Pavan Nikhilesh <pbhagavat...@marvell.com>
> > > > > ---
> > > > >  config/arm/meson.build                             |   1 +
> > > > >  config/common_base                                 |   5 +
> > > > >  .../common/include/arch/arm/rte_pause_64.h         |  30 ++++++
> > > > >  lib/librte_eal/common/include/generic/rte_pause.h  | 106
> > > +++++++++++++++++++++
> > > > >  4 files changed, 142 insertions(+)
> > > > >
> > > > > diff --git a/config/arm/meson.build b/config/arm/meson.build
> > > > > index 979018e..b4b4cac 100644
> > > > > --- a/config/arm/meson.build
> > > > > +++ b/config/arm/meson.build
> > > > > @@ -26,6 +26,7 @@ flags_common_default = [
> > > > >       ['RTE_LIBRTE_AVP_PMD', false],
> > > > >
> > > > >       ['RTE_SCHED_VECTOR', false],
> > > > > +     ['RTE_ARM_USE_WFE', false],
> > > > >  ]
> > > > >
> > > > >  flags_generic = [
> > > > > diff --git a/config/common_base b/config/common_base
> > > > > index 8ef75c2..8861713 100644
> > > > > --- a/config/common_base
> > > > > +++ b/config/common_base
> > > > > @@ -111,6 +111,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
> > > > >  CONFIG_RTE_MALLOC_DEBUG=n
> > > > >  CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
> > > > >  CONFIG_RTE_USE_LIBBSD=n
> > > > > +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> > > > > +# calling these APIs put the cores in low power state while waiting
> > > > > +# for the memory address to become equal to the expected value.
> > > > > +# This is supported only by aarch64.
> > > > > +CONFIG_RTE_ARM_USE_WFE=n
> > > > >
> > > > >  #
> > > > >  # Recognize/ignore the AVX/AVX512 CPU flags for performance/power
> > > testing.
> > > > > diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > > > index 93895d3..dabde17 100644
> > > > > --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > > > +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > > > @@ -1,5 +1,6 @@
> > > > >  /* SPDX-License-Identifier: BSD-3-Clause
> > > > >   * Copyright(c) 2017 Cavium, Inc
> > > > > + * Copyright(c) 2019 Arm Limited
> > > > >   */
> > > > >
> > > > >  #ifndef _RTE_PAUSE_ARM64_H_
> > > > > @@ -17,6 +18,35 @@ static inline void rte_pause(void)
> > > > >       asm volatile("yield" ::: "memory");
> > > > >  }
> > > > >
> > > > > +#ifdef RTE_ARM_USE_WFE
> > > > > +#define __WAIT_UNTIL_EQUAL(name, asm_op, wide, type) \
> > > > > +static __rte_always_inline void \
> > > > > +rte_wait_until_equal_##name(volatile type * addr, type expected) \
> > > > > +{ \
> > > > > +     type tmp; \
> > > > > +     asm volatile( \
> > > > > +             #asm_op " %" #wide "[tmp], %[addr]\n" \
> > > > > +             "cmp    %" #wide "[tmp], %" #wide "[expected]\n" \
> > > > > +             "b.eq   2f\n" \
> > > > > +             "sevl\n" \
> > > > > +             "1:     wfe\n" \
> > > > > +             #asm_op " %" #wide "[tmp], %[addr]\n" \
> > > > > +             "cmp    %" #wide "[tmp], %" #wide "[expected]\n" \
> > > > > +             "bne    1b\n" \
> > > > > +             "2:\n" \
> > > > > +             : [tmp] "=&r" (tmp) \
> > > > > +             : [addr] "Q"(*addr), [expected] "r"(expected) \
> > > > > +             : "cc", "memory"); \
> > > > > +}
> > >
> > > One more thought:
> > > Why do you need to write asm code for the whole procedure?
> > > Why not to do like linux kernel:
> > > define wfe() and sev() macros and use them inside normal C code?
> > >
> > > #define sev()             asm volatile("sev" : : : "memory")
> > > #define wfe()             asm volatile("wfe" : : : "memory")
> > >
> > > Then:
> > > rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected, int
> > > memorder)
> > > {
> > >      if (__atomic_load_n(addr, memorder) != expected) {
> > >          sev();
> > >          do {
> > >              wfe();
> > >          } while ((__atomic_load_n(addr, memorder) != expected);
> > >      }
> > > }
> > >
> > > ?
> > A really good suggestion, I made corresponding changes to v8 already, but it
> > missed a armv8 specific feature after internal discussion.
> > We call wfe to wait/sleep on the 'monitored' address, it will be waken up
> > upon someone write to the monitor address, so before wfe, we have to call
> > load-exclusive instruction to 'monitor'.
> > __atomic_load_n - disassembled to "ldr" does not do so. We have to use
> > "ldxrh" for relaxed mem ordering and "ldaxrh" for acquire ordering, in
> > example of 16-bit.

Didn't realize that, sorry for confusion caused...

> >
> > Let me re-think coming back to the full assembly procedure or implementing
> > a 'load-exclusive' function. What do you think?

After some thought I am leaning towards 'load-exclusive' function -
Hopefully it would help you avoid ras asm here and in other places.
What do you think?
Konstantin

> > /Gavin
> Forgot to mention, kernel uses wfe() without preceding load-exclusive 
> instructions because:
> 1) it replies on the timer, to wake up, i.e. __delay()
> 2) explicit calling sev to send wake events, for all kinds of locks
> 3) IPI instructions.
> 
> Our patches can't count on these events, due to of lack of these events or 
> performance  impact.
> /Gavin
> >
> > > > > +/* Wait for *addr to be updated with expected value */
> > > > > +__WAIT_UNTIL_EQUAL(relaxed_16, ldxrh, w, uint16_t)
> > > > > +__WAIT_UNTIL_EQUAL(acquire_16, ldaxrh, w, uint16_t)
> > > > > +__WAIT_UNTIL_EQUAL(relaxed_32, ldxr, w, uint32_t)
> > > > > +__WAIT_UNTIL_EQUAL(acquire_32, ldaxr, w, uint32_t)
> > > > > +__WAIT_UNTIL_EQUAL(relaxed_64, ldxr, x, uint64_t)
> > > > > +__WAIT_UNTIL_EQUAL(acquire_64, ldaxr, x, uint64_t)
> > > > > +#endif
> > > > > +
> > > > >  #ifdef __cplusplus
> > > > >  }
> > > > >  #endif

Reply via email to