> > Subject: RE: [PATCH v3 6/8] stack: add C11 atomic implementation
> >
> > [snip]
> >
> > > > +static __rte_always_inline void
> > > > +__rte_stack_lf_push(struct rte_stack_lf_list *list,
> > > > +                   struct rte_stack_lf_elem *first,
> > > > +                   struct rte_stack_lf_elem *last,
> > > > +                   unsigned int num)
> > > > +{
> > > > +#ifndef RTE_ARCH_X86_64
> > > > +       RTE_SET_USED(first);
> > > > +       RTE_SET_USED(last);
> > > > +       RTE_SET_USED(list);
> > > > +       RTE_SET_USED(num);
> > > > +#else
> > > > +       struct rte_stack_lf_head old_head;
> > > > +       int success;
> > > > +
> > > > +       old_head = list->head;
> > > This can be a torn read (same as you have mentioned in
> > > __rte_stack_lf_pop). I suggest we use acquire thread fence here as
> > > well (please see the comments in __rte_stack_lf_pop).
> >
> > Agreed. I'll add the acquire fence.
> >
> 
> On second thought, an acquire fence isn't necessary. The acquire fence in
> __rte_stack_lf_pop() ensures the list->head is ordered before the list element
> reads. That isn't necessary here; we need to ensure that the last->next write
> occurs (and is observed) before the list->head write, which the CAS's RELEASE
> success memorder accomplishes.
> 
> If a torn read occurs, the CAS will fail and will atomically re-load 
> &old_head.

Following is my understanding:
The general guideline is there should be a load-acquire for every 
store-release. In both xxx_lf_pop and xxx_lf_push, the head is store-released, 
hence the load of the head should be load-acquire.
>From the code (for ex: in function _xxx_lf_push), you can notice that there is 
>dependency from 'old_head to new_head to list->head(in compare_exchange)'. 
>When such a dependency exists, if the memory orderings have to be avoided, one 
>needs to use __ATOMIC_CONSUME. Currently, the compilers will use a stronger 
>memory order (which is __ATOMIC_ACQUIRE) as __ATOMIC_CONSUME is not well 
>defined. Please refer to [1] and [2] for more info.

IMO, since, for 128b, we do not have a pure load-acquire, I suggest we use 
thread_fence with acquire semantics. It is a heavier barrier, but I think it is 
a safer code which will adhere to C11 memory model.

[1] https://preshing.com/20140709/the-purpose-of-memory_order_consume-in-cpp11/
[2] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0750r1.html

Reply via email to