> > > > > > > >>> In weak memory models, like arm64, reading the
> > > > > > > >>> {prod,cons}.tail may get reordered after reading or
> > > > > > > >>> writing the ring slots, which corrupts the ring and stale 
> > > > > > > >>> data is
> > observed.
> > > > > > > >>>
> > > > > > > >>> This issue was reported by NXP on 8-A72 DPAA2 board. The
> > > > problem
> > > > > > > >>> is
> > > > > > > >> most
> > > > > > > >>> likely caused by missing the acquire semantics when
> > > > > > > >>> reading cons.tail (in SP enqueue) or prod.tail (in SC
> > > > > > > >>> dequeue) which makes it possible to
> > > > > > > read
> > > > > > > >> a
> > > > > > > >>> stale value from the ring slots.
> > > > > > > >>>
> > > > > > > >>> For MP (and MC) case, rte_atomic32_cmpset() already
> > > > > > > >>> provides
> > > > the
> > > > > > > >> required
> > > > > > > >>> ordering. This patch is to prevent reading and writing the
> > > > > > > >>> ring slots get reordered before reading {prod,cons}.tail
> > > > > > > >>> for SP (and SC)
> > > > > case.
> > > > > > > >>
> > > > > > > >> Read barrier rte_smp_rmb() is OK to prevent reading the
> > > > > > > >> ring get reordered before reading the tail. However, to
> > > > > > > >> prevent *writing* the ring get reordered *before reading*
> > > > > > > >> the tail you need a full memory barrier, i.e.
> > > > > > > >> rte_smp_mb().
> > > > > > > >
> > > > > > > > ISHLD(rte_smp_rmb is DMB(ishld) orders LD/LD and LD/ST,
> > > > > > > > while WMB(ST
> > > > > > > Option) orders ST/ST.
> > > > > > > > For more details, please refer to: Table B2-1 Encoding of
> > > > > > > > the DMB and DSB
> > > > > > > <option> parameter  in
> > > > > > > > https://developer.arm.com/docs/ddi0487/latest/arm-architectu
> > > > > > > > re-
> > > > > > > reference-manual-armv8-for-armv8-a-architecture-profile
> > > > > > >
> > > > > > > I see. But you have to change the rte_smp_rmb() function
> > > > > > > definition in
> > > > > > > lib/librte_eal/common/include/generic/rte_atomic.h and assure that
> > all other architectures follows same rules.
> > > > > > > Otherwise, this change is logically wrong, because read
> > > > > > > barrier in current definition could not be used to order Load with
> > Store.
> > > > > > >
> > > > > >
> > > > > > Good points, let me re-think how to handle for other architectures.
> > > > > > Full MB is required for other architectures(x86? Ppc?), but for
> > > > > > arm, read
> > > > > barrier(load/store and load/load) is enough.
> > > > >
> > > > > For x86, I don't think you need any barrier here, as with IA memory
> > mode:
> > > > > -  Reads are not reordered with other reads.
> > > > > - Writes are not reordered with older reads.
> > > > Agree
> > >
> > > I understand herein no instruction level barriers are required for IA,
> > > but how about the compiler barrier: rte_compiler_barrier?
> > >
> > > >
> > > > >
> > > > > BTW, could you explain a bit more why barrier is necessary even on
> > > > > arm
> > > > here?
> > > > > As I can see, there is a data dependency between the tail value
> > > > > and subsequent address calculations for ring writes/reads.
> > > > > Isn't that sufficient to prevent re-ordering even for weak memory 
> > > > > model?
> > > > The tail value affects 'n'. But, the value of 'n' can be speculated
> > > > because of the following 'if' statement:
> > > >
> > > > if (unlikely(n > *free_entries))
> > > >                         n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 :
> > > > *free_entries;
> > > >
> > > > The address calculations for actual ring writes/reads do not depend
> > > > on the tail value.
> >
> > Ok, agree I formulated it wrongly, only limit value is dependent on 
> > cons.tail.
> > Address is not.
> >
> > >Since 'n' can be speculated, the writes/reads can be moved up
> > > > before the load of the tail value.
> >
> > For my curiosity: ok, I understand that 'n' value can be speculated, and
> > speculative stores could start before n is calculated properly...
> > But are you saying that such speculative store results might be visible to 
> > the
> > other observers (different cpu)?
> >
> You are correct. The speculative stores will NOT be visible to other 
> observers till the value of 'n' is fixed. Speculative stores might have to be
> discarded depending on the value of 'n' (which will affect cache performance).
> There is also a control dependency between the load of cons.tail and the 
> stores to the ring. That should also keep the load and stores from
> getting reordered (though I am not sure if it still allows for speculative 
> stores).
> So, IMO, the barrier in enqueue is not needed. Is this what you wanted to 
> drive at?

Yes, that was my thought.
Thanks
Konstantin

Reply via email to