Re: [dpdk-dev] [EXT] [PATCH v3 5/6] spinlock: reimplement with atomic one-way barrier builtins

Gavin Hu (Arm Technology China) Fri, 11 Jan 2019 05:53:12 -0800


> -----Original Message-----
> From: Jerin Jacob Kollanukkaran <jer...@marvell.com>
> Sent: Thursday, December 27, 2018 3:42 PM
> To: Gavin Hu (Arm Technology China) <gavin...@arm.com>;
> dev@dpdk.org
> Cc: david.march...@redhat.com; chao...@linux.vnet.ibm.com; nd
> <n...@arm.com>; bruce.richard...@intel.com; tho...@monjalon.net;
> hemant.agra...@nxp.com; step...@networkplumber.org; Honnappa
> Nagarahalli <honnappa.nagaraha...@arm.com>
> Subject: Re: [EXT] [PATCH v3 5/6] spinlock: reimplement with atomic one-
> way barrier builtins
> 
> On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> -------------------------------------------------------------------
> > ---
> > The __sync builtin based implementation generates full memory
> > barriers
> > ('dmb ish') on Arm platforms. Using C11 atomic builtins to generate
> > one way
> > barriers.
> >
> > Here is the assembly code of __sync_compare_and_swap builtin.
> > __sync_bool_compare_and_swap(dst, exp, src);
> >    0x000000000090f1b0 <+16>:    e0 07 40 f9 ldr x0, [sp, #8]
> >    0x000000000090f1b4 <+20>:    e1 0f 40 79 ldrh    w1, [sp, #6]
> >    0x000000000090f1b8 <+24>:    e2 0b 40 79 ldrh    w2, [sp, #4]
> >    0x000000000090f1bc <+28>:    21 3c 00 12 and w1, w1, #0xffff
> >    0x000000000090f1c0 <+32>:    03 7c 5f 48 ldxrh   w3, [x0]
> >    0x000000000090f1c4 <+36>:    7f 00 01 6b cmp w3, w1
> >    0x000000000090f1c8 <+40>:    61 00 00 54 b.ne    0x90f1d4
> > <rte_atomic16_cmpset+52>  // b.any
> >    0x000000000090f1cc <+44>:    02 fc 04 48 stlxrh  w4, w2, [x0]
> >    0x000000000090f1d0 <+48>:    84 ff ff 35 cbnz    w4, 0x90f1c0
> > <rte_atomic16_cmpset+32>
> >    0x000000000090f1d4 <+52>:    bf 3b 03 d5 dmb ish
> >    0x000000000090f1d8 <+56>:    e0 17 9f 1a cset    w0, eq  // eq =
> > none
> >
> > The benchmarking results showed 3X performance gain on Cavium
> > ThunderX2 and
> > 13% on Qualcomm Falmon and 3.7% on 4-A72 Marvell macchiatobin.
> > Here is the example test result on TX2:
> >
> > *** spinlock_autotest without this patch ***
> > Core [123] Cost Time = 639822 us
> > Core [124] Cost Time = 633253 us
> > Core [125] Cost Time = 646030 us
> > Core [126] Cost Time = 643189 us
> > Core [127] Cost Time = 647039 us
> > Total Cost Time = 95433298 us
> >
> > *** spinlock_autotest with this patch ***
> > Core [123] Cost Time = 163615 us
> > Core [124] Cost Time = 166471 us
> > Core [125] Cost Time = 189044 us
> > Core [126] Cost Time = 195745 us
> > Core [127] Cost Time = 78423 us
> > Total Cost Time = 27339656 us
> >
> > Signed-off-by: Gavin Hu <gavin...@arm.com>
> > Reviewed-by: Phil Yang <phil.y...@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>
> > Reviewed-by: Ola Liljedahl <ola.liljed...@arm.com>
> > Reviewed-by: Steve Capper <steve.cap...@arm.com>
> > ---
> >  lib/librte_eal/common/include/generic/rte_spinlock.h | 18
> > +++++++++++++-----
> >  1 file changed, 13 insertions(+), 5 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > index c4c3fc31e..87ae7a4f1 100644
> > --- a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > +++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > @@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl);
> >  static inline void
> >  rte_spinlock_lock(rte_spinlock_t *sl)
> >  {
> > -   while (__sync_lock_test_and_set(&sl->locked, 1))
> > -           while(sl->locked)
> > +   int exp = 0;
> > +
> > +   while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
> > +                           __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
> {
> 
> How about remove explict exp = 0 and change to
> __atomic_test_and_set(flag, __ATOMIC_ACQUIRE);


Yes, __atomic_test_and_set means simpler code and better, but 
__atomic_test_and_set takes the first argument as a pointer to type bool or 
char, in our case, sl->locked is of type uint32. 
We can force it to uint8, or just pass in the 32bit pointer, only one byte/bit 
is really used in this case, is that ok? 

"It should be only used for operands of type bool or char. For other types only 
part of the value may be set."
https://gcc.gnu.org/onlinedocs/gcc-6.1.0/gcc/_005f_005fatomic-Builtins.html

From performance perspective, in our testing, the performance was very close, 
compared to __atomic.

> 
> i.e
> while (_atomic_test_and_set(flag, __ATOMIC_ACQUIRE))
> 
> 
> 
> > +           while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
> >                     rte_pause();
> > +           exp = 0;
> 
> We can remove exp = 0 with above scheme.
> 
> > +   }
> >  }
> >  #endif
> >
> > @@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl);
> >  static inline void
> >  rte_spinlock_unlock (rte_spinlock_t *sl)
> >  {
> > -   __sync_lock_release(&sl->locked);
> > +   __atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
>  }
> >  #endif
> >
> > @@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl);
> >  static inline int
> >  rte_spinlock_trylock (rte_spinlock_t *sl)
> >  {
> > -   return __sync_lock_test_and_set(&sl->locked,1) == 0;
> > +   int exp = 0;
> > +   return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
> > +                           0, /* disallow spurious failure */
> > +                           __ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
> 
> Here to remove explicit exp.
> 
> return (__atomic_test_and_set(flag, __ATOMIC_ACQUIRE) == 0)
> 
> 
> >  }
> >  #endif
> >
> > @@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
> >   */
> >  static inline int rte_spinlock_is_locked (rte_spinlock_t *sl)
> >  {
> > -   return sl->locked;
> > +   return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
> 
> __ATOMIC_RELAXED would be enough here. Right ?
> 
> 
> >  }
> >
> >  /**

Re: [dpdk-dev] [EXT] [PATCH v3 5/6] spinlock: reimplement with atomic one-way barrier builtins

Reply via email to