On Thu, 12 Oct 2017 17:53:51 +0200
Olivier MATZ <[email protected]> wrote:
> Hi,
>
> On Tue, Oct 10, 2017 at 05:56:36PM +0800, Jia He wrote:
> > Before this patch:
> > In __rte_ring_move_cons_head()
> > ...
> > do {
> > /* Restore n as it may change every loop */
> > n = max;
> >
> > *old_head = r->cons.head; //1st load
> > const uint32_t prod_tail = r->prod.tail; //2nd load
> >
> > In weak memory order architectures(powerpc,arm), the 2nd load might be
> > reodered before the 1st load, that makes *entries is bigger than we wanted.
> > This nasty reording messed enque/deque up.
> >
> > cpu1(producer) cpu2(consumer) cpu3(consumer)
> > load r->prod.tail
> > in enqueue:
> > load r->cons.tail
> > load r->prod.head
> >
> > store r->prod.tail
> >
> > load r->cons.head
> > load r->prod.tail
> > ...
> > store r->cons.{head,tail}
> > load r->cons.head
> >
> > THEN,r->cons.head will be bigger than prod_tail, then make *entries very big
> >
> > After this patch, the old cons.head will be recaculated after failure of
> > rte_atomic32_cmpset
> >
> > There is no such issue in X86 cpu, because X86 is strong memory order model
> >
> > Signed-off-by: Jia He <[email protected]>
> > Signed-off-by: [email protected]
> > Signed-off-by: [email protected]
> > Signed-off-by: [email protected]
> >
> > ---
> > lib/librte_ring/rte_ring.h | 8 ++++++++
> > 1 file changed, 8 insertions(+)
> >
> > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
> > index 5e9b3b7..15c72e2 100644
> > --- a/lib/librte_ring/rte_ring.h
> > +++ b/lib/librte_ring/rte_ring.h
> > @@ -409,6 +409,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int
> > is_sp,
> > n = max;
> >
> > *old_head = r->prod.head;
> > +
> > + /* load of prod.tail can't be reordered before cons.head */
> > + rte_smp_rmb();
> > +
> > const uint32_t cons_tail = r->cons.tail;
> > /*
> > * The subtraction is done between two unsigned 32bits value
> > @@ -517,6 +521,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int
> > is_sc,
> > n = max;
> >
> > *old_head = r->cons.head;
> > +
> > + /* load of prod.tail can't be reordered before cons.head */
> > + rte_smp_rmb();
> > +
> > const uint32_t prod_tail = r->prod.tail;
> > /* The subtraction is done between two unsigned 32bits value
> > * (the result is always modulo 32 bits even if we have
> > --
> > 2.7.4
> >
>
> The explanation convinces me.
>
> However, since it's in a critical path, it would be good to have other
> opinions. This patch reminds me this discussion, that was also related to
> memory barrier, but at another place:
> http://dpdk.org/ml/archives/dev/2016-July/043765.html
> Lead to that patch: http://dpdk.org/browse/dpdk/commit/?id=ecc7d10e448e
> But finally reverted: http://dpdk.org/browse/dpdk/commit/?id=c3acd92746c3
>
> Konstatin, Jerin, do you have any comment?
On strongly ordered architectures like x86 rte_smp_rmb() turns into
compiler_barrier().
Compiler barriers generate no instructions, they just tell the compiler to
not do data flow optimization across. Since the pointers are mostly volatile
it shouldn't even change the generated code.