> -----Original Message-----
> From: Honnappa Nagarahalli [mailto:honnappa.nagaraha...@arm.com]
> Sent: Wednesday, January 23, 2019 11:22 PM
> To: Eads, Gage <gage.e...@intel.com>; dev@dpdk.org
> Cc: olivier.m...@6wind.com; arybche...@solarflare.com; Richardson, Bruce
> <bruce.richard...@intel.com>; Ananyev, Konstantin
> <konstantin.anan...@intel.com>; nd <n...@arm.com>;
> chao...@linux.vnet.ibm.com; jer...@marvell.com; hemant.agra...@nxp.com;
> nd <n...@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only)
>
> > >
> > > Added other platform owners.
> > >
> > > > > > > > @@ -208,4 +209,25 @@ static inline void
> > > > > > > > rte_atomic64_clear(rte_atomic64_t
> > > > > > > > *v) } #endif
> > > > > > > >
> > > > > > > > +static inline int
> > > > > > > > +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t
> > > > > > > > +*exp, uint64_t
> > > > > > > > +*src) {
> > > > > > > The API name suggests it is a 128b operation. 'dst', 'exp' and
> > > > > > > 'src'
> > > > > > > should be pointers to 128b (__int128)? Or we could define
> > > > > > > our own data
> > > > > > type.
> > > > > >
> > > > > > I agree, I'm not a big fan of the 64b pointers here. I avoided
> > > > > > __int128 originally because it fails to compile with
> > > > > > -pedantic, but on second thought (and with your suggestion of
> > > > > > a separate data type), we can resolve that with this typedef:
> > > > > >
> > > > > > typedef struct {
> > > > > > RTE_STD_C11 __int128 val; } rte_int128_t;
> > > > > ok
> > > > >
> > > > > >
> > > > > > > Since, it is a new API, can we define it with memory
> > > > > > > orderings which will be more conducive to relaxed memory
> > > > > > > ordering based
> > > > architectures?
> > > > > > > You can refer to [1] and [2] for guidance.
> > > > > >
> > > > > > I certainly see the value in controlling the operation's
> > > > > > memory ordering, like in the __atomic intrinsics, but I'm not
> > > > > > sure this patchset is the right place to address that. I see
> > > > > > that work going a couple
> > > > > ways:
> > > > > > 1. Expand the existing rte_atomicN_* interfaces with
> > > > > > additional arguments. In that case, I'd prefer this be done in
> > > > > > a separate patchset that addresses all the atomic operations,
> > > > > > not just cmpset, so the interface changes are chosen according
> > > > > > to the needs of the full set of atomic operations. If this
> > > > > > approach is taken then there's no need to solve this while
> > > > > > rte_atomic128_cmpset is experimental, since all the
> > > > > other functions are non-experimental anyway.
> > > > > >
> > > > > > - Or -
> > > > > >
> > > > > > 2. Don't modify the existing rte_atomicN_* interfaces (or
> > > > > > their strongly ordered behavior), and instead create new
> > > > > > versions of them that take additional arguments. In this case,
> > > > > > we can implement
> > > > > > rte_atomic128_cmpset() as is and create a more flexible
> > > > > > version in a later
> > > > > patchset.
> > > > > >
> > > > > > Either way, I think the current interface (w.r.t. memory
> > > > > > ordering
> > > > > > options) can work and still leaves us in a good position for
> > > > > > future
> > > > > changes/improvements.
> > > > > >
> > > > > I do not see the need to modify/extend the existing
> > > > > rte_atomicN_* APIs as the corresponding __atomic intrinsics serve as
> replacements.
> > > > > I expect that at some point, DPDK code base will not be using
> > > > rte_atomicN_* APIs.
> > > > > However, __atomic intrinsics do not support 128b wide parameters.
> > > > > Hence
> > > >
> > > > I don't think that's correct. From the GCC docs:
> > > >
> > > > "16-byte integral types are also allowed if `__int128' (see
> > > > __int128) is supported by the architecture."
> > > >
> > > > This works with x86 -64 -- I assume aarch64 also, but haven't confirmed.
> > > >
> > > > Source:
> > > > https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/_005f_005fatomic-
> > > > Builtins.html
> > > (following is based on my reading, not based on experiments) My
> > > understanding is that the __atomic_load_n/store_n were implemented
> > > using a compare-and- swap HW instruction (due to lack of HW 128b
> > > atomic load and store instructions). This introduced the side effect
> > > or store/load respectively. Where as the user does not expect such
> > > side effects. As suggested in [1], it looks like GCC delegated the
> > > implementation to libatomic which 'it seems' uses locks to implement
> > > 128b __atomic intrinsics (needs to be verified)
> > >
> > > If __atomic intrinsics, for any of the supported platforms, do not
> > > have an optimal implementation, I prefer to add a DPDK API as an
> > > abstraction. A given platform can choose to implement such an API
> > > using __atomic intrinsics if it wants. The DPDK API can be similar
> > > to
> > __atomic_compare_exchange_n.
> > >
> >
> > Certainly. From the linked discussions, I see how this would affect
> > the design of (hypothetical functions) rte_atomic128_read() and
> > rte_atomic128_set(), but I don't see anything that suggests (for the
> > architectures being discussed) that __atomic_compare_exchange_n is
> suboptimal.
> I wrote some code and generated assembly to verify what is happening. On
> aarch64, this call is delegated to libatomic and libatomic needs to be
> linked. In
> the generated assembly, it is clear that it uses locks (pthread mutex lock) to
> provide atomicity for. For 32b and 64b the compiler generates the expected
> inline assembly. Both, ' __atomic_always_lock_free' and '
> __atomic_is_lock_free' return 0 to indicate that 128b __atomic intrinsics are
> not
> lock free. (gcc - 8.2)
>
> Out of curiosity, I also did similar experiments on x86 (CPU E5-2660 v4). Even
> after using -mcx16 flag the call is delegated to libatomic. I see the 'lock
> cmpxchg16b' in the generated assembly. But, ' __atomic_always_lock_free' and
> ' __atomic_is_lock_free' return 0 to indicate that 128b __atomic intrinsics
> are
> NOT lock free with gcc version 7.3.0. However with gcc version 5.4.0, '
> __atomic_is_lock_free' returns 1. I found more discussion at [3]. However, I
> am
> not an expert on x86.
>
> [3] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878
>
> These problems do not exist with 32b and 64b.
>
Thanks for investigating this. If GCC doesn't compile optimal code for aarch64
(i.e. LDXP+STXP or CASP) I don't think we have a choice but to use our own
implementation for 128-bit atomics, and it makes sense to model the interface
after the __atomic instrinsics as you suggested.
> >
> > > [1] https://patchwork.ozlabs.org/patch/721686/
> > > [2] https://gcc.gnu.org/ml/gcc/2017-01/msg00167.html
> > >
> > > >
> > > > > DPDK needs to write its own. Since this is the first API in that
> > > > > regard, I prefer that we start with a signature that resembles
> > > > > __atomic intrinsics which have been proven to provide best
> > > > > flexibility for
> > > > all the platforms supported by DPDK.