Hi Diogo,


Thanks for your explanation.



As documented in Arm ARM<https://developer.arm.com/documentation/ddi0487/fc>  
B2.9.5 Load-Exclusive and Store-Exclusive instruction usage restrictions:

" Between the Load-Exclusive and the Store-Exclusive, there are no explicit 
memory accesses, preloads,

direct or indirect System register writes, address translation instructions, 
cache or TLB maintenance

instructions, exception generating instructions, exception returns, or indirect 
branches."



So it is not allowed to insert (1) & (4) between (2, 3). The cmpxchg operation 
is atomic.



Thanks,

Phil Yang



> -----Original Message-----

> From: Diogo Behrens <diogo.behr...@huawei.com>

> Sent: Thursday, August 27, 2020 4:57 PM

> To: Phil Yang <phil.y...@arm.com>

> Cc: dev@dpdk.org; nd <n...@arm.com>; Honnappa Nagarahalli

> <honnappa.nagaraha...@arm.com>

> Subject: RE: [PATCH] librte_eal: fix mcslock hang on weak memory

>

> Hi Phil,

>

> thanks for taking a look at this. Indeed, the cmpxchg will have release

> semantics, but implicit barriers do not provide the same ordering guarantees

> as DMB ISH, ie, atomic_thread_fence(__ATOMIC_ACQ_REL).

>

> Let me try to explain the scenario. Here is a pseudo-ARM assembly with the

> instructions involved:

>

> STR me->locked  = 1   // 1: initialize the node

> LDAXR pred = tail  // 2: cmpxchg

> STLXR tail = me    // 3: cmpxchg

> STR pred->next = me // 4: the problematic store

>

> Now, ARM allows instruction 1 and 2 to be reordered, and also instructions 3

> and 4 to be reordered:

>

> LDAXR pred = tail  // 2: cmpxchg  (acquires can be moved up)

> STR me-> locked = 1   // 1: initialize the node

> STR pred->next = me // 4: the problematic store

> STLXR tail = me    // 3: cmpxchg (releases can be moved down)

>

> And since stores 1 and 4 can be reordered too, the following execution is

> possible:

>

> LDAXR pred = tail  // 2: cmpxchg

> STR pred->next = me // 4: the problematic store

> STR me-> locked = 1   // 1: initialize the node

> STLXR tail = me    // 3: cmpxchg

>

> In this subtle situation, the locking-thread can overwrite the store to next-

> >locked of the unlocking-thread (if it occurs between 4 and 1), and

> subsequently hang waiting for me->locked to become 0.

>

> We couldn't reproduce this with our available hardware, but we

> "reproduced" it with the RMEM model checker, which precisely models the

> ARM memory model (https://github.com/rems-project/rmem). The GenMC

> model checker (https://github.com/MPI-SWS/genmc) also finds the issue,

> but its intermediate memory model is slightly more general than the ARM

> model.

>

> The release attached to store (4) prevents (1) to reordering with it.

>

> Please, let me know if that makes sense or if I am missing something.

>

> Thanks,

> -Diogo

>

> -----Original Message-----

> From: Phil Yang [mailto:phil.y...@arm.com]

> Sent: Wednesday, August 26, 2020 12:17 PM

> To: Diogo Behrens <diogo.behr...@huawei.com<mailto:diogo.behr...@huawei.com>>

> Cc: dev@dpdk.org<mailto:dev@dpdk.org>; nd 
> <n...@arm.com<mailto:n...@arm.com>>; Honnappa Nagarahalli

> <honnappa.nagaraha...@arm.com<mailto:honnappa.nagaraha...@arm.com>>; nd 
> <n...@arm.com<mailto:n...@arm.com>>

> Subject: RE: [PATCH] librte_eal: fix mcslock hang on weak memory

>

> Diogo Behrens <diogo.behr...@huawei.com<mailto:diogo.behr...@huawei.com>> 
> writes:

>

> > Subject: [PATCH] librte_eal: fix mcslock hang on weak memory

> >

> >     The initialization me->locked=1 in lock() must happen before

> >     next->locked=0 in unlock(), otherwise a thread may hang forever,

> >     waiting me->locked become 0. On weak memory systems (sHi Puch as

> ARMv8),

> >     the current implementation allows me->locked=1 to be reordered with

> >     announcing the node (pred->next=me) and, consequently, to be

> >     reordered with next->locked=0 in unlock().

> >

> >     This fix adds a release barrier to pred->next=me, forcing

> >     me->locked=1 to happen before this operation.

> >

> > Signed-off-by: Diogo Behrens 
> > <diogo.behr...@huawei.com<mailto:diogo.behr...@huawei.com>>

> > ---

> >  lib/librte_eal/include/generic/rte_mcslock.h | 9 ++++++++-

> >  1 file changed, 8 insertions(+), 1 deletion(-)

> >

> > diff --git a/lib/librte_eal/include/generic/rte_mcslock.h

> > b/lib/librte_eal/include/generic/rte_mcslock.h

> > index 2bef28351..ce553f547 100644

> > --- a/lib/librte_eal/include/generic/rte_mcslock.h

> > +++ b/lib/librte_eal/include/generic/rte_mcslock.h

> > @@ -68,7 +68,14 @@ rte_mcslock_lock(rte_mcslock_t **msl,

> rte_mcslock_t

> > *me)

> >                           */

> >                          return;

> >          }

> > -       __atomic_store_n(&prev->next, me, __ATOMIC_RELAXED);

> > +      /* The store to me->next above should also complete before the

> > node is

> > +      * visible to predecessor thread releasing the lock. Hence, the store

> > +      * prev->next also requires release semantics. Note that, for

> > example,

> > +      * on ARM, the release semantics in the exchange operation is not

> > +      * strong as a release fence and is not sufficient to enforce the

> > +      * desired order here.

>

>

> Hi Diogo,

>

> I didn't quite understand why the exchange operation with ACQ_REL

> memory ordering is not sufficient.

> It emits 'stlxr' on armv8.0-a and 'swpal' on armv8.1-a (with LSE support).

> Both of these two instructions contain a release semantics.

>

> Please check the URL for the detail.

> https://gcc.godbolt.org/z/Mc4373

>

> BTW, if you captured a deadlock issue on your testbed, could you please

> share your test case here?

>

> Thanks,

> Phil

>

>

> > +      */

> > +      __atomic_store_n(&prev->next, me, __ATOMIC_RELEASE);

> >

> >          /* The while-load of me->locked should not move above the

> previous

> >           * store to prev->next. Otherwise it will cause a deadlock. Need a

> > --

> > 2.28.0

> >


Reply via email to