<snip> > > Hi, everyone > > This patch can be closed with the following reasons. > > > -----邮件原件----- > > 发件人: dev <dev-boun...@dpdk.org> 代表 Honnappa Nagarahalli > > 发送时间: 2021年3月28日 9:00 > > 收件人: tho...@monjalon.net; Takeshi Yoshimura > > <t.yoshimura8...@gmail.com> > > 抄送: sta...@dpdk.org; dev@dpdk.org; olivier.m...@6wind.com; > > chao...@linux.vnet.ibm.com; konstantin.anan...@intel.com; Jerin Jacob > > <jerin.ja...@caviumnetworks.com>; nd <n...@arm.com>; nd > <n...@arm.com> > > 主题: Re: [dpdk-dev] [dpdk-stable] [PATCH] rte_ring: fix racy > > dequeue/enqueue in ppc64 > > > > <snip> > > > > > Subject: Re: [dpdk-stable] [dpdk-dev] [PATCH] rte_ring: fix racy > > > dequeue/enqueue in ppc64 > > > > > > No reply after more than 2 years. > > > Unfortunately it is probably outdated now. > > > Classified as "Changes Requested". > > Looking at the code, I think this patch in fact fixes a bug. > > Appreciate rebasing this patch. > > > > The problem is already fixed in '__rte_ring_move_cons_head' but needs > > to be fixed in '__rte_ring_move_prod_head'. > > This problem is fixed for C11 version due to acquire load of cons.tail > > and prod.tail. > > First, for consumer in dequeue: > the reason for that adding a rmb in move_cons_head of “generic” is based on > this patch: > http://patches.dpdk.org/project/dpdk/patch/1552409933-45684-2-git-send- > email-gavin...@arm.com/ > > Slot Consumer > Producer > 1 dequeue elements > 2 > update prod_tail > 3 load new prod_tail > 4 check room is enough(n < entries) > > Dequeue elements maybe before load updated prod_tail, so consumer can > load incorrect elements value. > For dequeue multiple consumers case, ‘rte_atomic32_cmpset’ with acquire > and release order can prevent dequeue before load prod_tail, no extra rmb is > needed. > > Second, for single producer in enqueue: > > Slot Producer > Consumer > 1 enqueue elements(not commited) > 2 > update > consumer_tail > 3 load new consumer_tail > 4 check room is enough(n < entries) > 5 enqueued elements is committed > > Though enqueue elements maybe reorder before load consumer_tail, these > elements will not be committed until ‘check’ has finished. So from load to > write control dependency is reliable and rmb is not needed here. > [1] https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf (page:15) > > As a result, it is unnecessary to add a rmb for enqueue single producer due to > control dependency. And this patch can be closed. Thanks Feifei, I did not consider the control dependency from load to store which is reliable in my comments below. Agree, we can reject this patch.
> > Best Regards > Feifei > > > > > > > > > > > 17/07/2018 05:34, Jerin Jacob: > > > > From: Takeshi Yoshimura <t.yoshimura8...@gmail.com> > > > > > > > > Cc: olivier.m...@6wind.com > > > > Cc: chao...@linux.vnet.ibm.com > > > > Cc: konstantin.anan...@intel.com > > > > > > > > > > > > > > > Adding rte_smp_rmb() cause performance regression on non x86 > > > platforms. > > > > > > Having said that, load-load barrier can be expressed very > > > > > > well with C11 memory model. I guess ppc64 supports C11 memory > model. > > > > > > If so, Could you try CONFIG_RTE_RING_USE_C11_MEM_MODEL=y > > for > > > > > > ppc64 and check original issue? > > > > > > > > > > Yes, the performance regression happens on non-x86 with single > > > > > producer/consumer. > > > > > The average latency of an enqueue was increased from 21 nsec to > > > > > 24 nsec in my simple experiment. But, I think it is worth it. > > > > > > > > That varies to machine to machine. What is the burst size etc. > > > > > > > > > > > > > > > > > > > I also tested C11 rte_ring, however, it caused the same race > > > > > condition in > > > ppc64. > > > > > I tried to fix the C11 problem as well, but I also found the C11 > > > > > rte_ring had other potential incorrect choices of memory orders, > > > > > which caused another race condition in ppc64. > > > > > > > > Does it happens on all ppc64 machines? Or on a specific machine? > > > > Is following tests are passing on your system without the patch? > > > > test/test/test_ring_perf.c > > > > test/test/test_ring.c > > > > > > > > > > > > > > For example, > > > > > __ATOMIC_ACQUIRE is passed to __atomic_compare_exchange_n(), > > but I > > > > > am not sure why the load-acquire is used for the compare exchange. > > > > > > > > It correct as per C11 acquire and release semantics. > > > > > > > > > Also in update_tail, the pause can be called before the data > > > > > copy because of ht->tail load without atomic_load_n. > > > > > > > > > > The memory order is simply difficult, so it might take a bit > > > > > longer time to check if the code is correct. I think I can fix > > > > > the > > > > > C11 rte_ring as another patch. > > > > > > > > > > >> > > > > > >> SPDK blobfs encountered a crash around rte_ring dequeues in > ppc64. > > > > > >> It uses a single consumer and multiple producers for a rte_ring. > > > > > >> The problem was a load-load reorder in > > rte_ring_sc_dequeue_bulk(). > > > > > > > > > > > > Adding rte_smp_rmb() cause performance regression on non x86 > > > platforms. > > > > > > Having said that, load-load barrier can be expressed very > > > > > > well with C11 memory model. I guess ppc64 supports C11 memory > model. > > > > > > If so, Could you try CONFIG_RTE_RING_USE_C11_MEM_MODEL=y > > for > > > > > > ppc64 and check original issue? > > > > > > > > > > > >> > > > > > >> The reordered loads happened on r->prod.tail in > > > > > > > > There is rte_smp_rmb() just before reading r->prod.tail in > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > _rte_ring_move_cons_head(). Would that not suffice the requirement? > > > > > > > > Can you check adding compiler barrier and see is compiler is > > > > reordering the stuff? > > > > > > > > DPDK's ring implementation is based freebsd's ring implementation, > > > > I don't see need for such barrier > > > > > > > > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h > > > > > > > > If it is something specific to ppc64 or a specific ppc64 machine, > > > > we could add a compile option as it is arch specific to avoid > > > > performance impact on other architectures. > > > > > > > > > >> __rte_ring_move_cons_head() (rte_ring_generic.h) and > > > > > >> ring[idx] in > > > > > >> DEQUEUE_PTRS() (rte_ring.h). They have a load-load control > > > > > >> dependency, but the code does not satisfy it. Note that they > > > > > >> are not reordered if __rte_ring_move_cons_head() with is_sc > > > > > >> != > > > > > >> 1 because cmpset invokes a read barrier. > > > > > >> > > > > > >> The paired stores on these loads are in ENQUEUE_PTRS() and > > > > > >> update_tail(). Simplified code around the reorder is the following. > > > > > >> > > > > > >> Consumer Producer > > > > > >> load idx[ring] > > > > > >> store idx[ring] > > > > > >> store r->prod.tail load r->prod.tail > > > > > >> > > > > > >> In this case, the consumer loads old idx[ring] and confirms > > > > > >> the load is valid with the new r->prod.tail. > > > > > >> > > > > > >> I added a read barrier in the case where __IS_SC is passed to > > > > > >> __rte_ring_move_cons_head(). I also fixed > > > > > >> __rte_ring_move_prod_head() to avoid similar problems with a > > > > > >> single > > > producer. > > > > > >> > > > > > >> Cc: sta...@dpdk.org > > > > > >> > > > > > >> Signed-off-by: Takeshi Yoshimura <t...@jp.ibm.com> > > > > > >> --- > > > > > >> lib/librte_ring/rte_ring_generic.h | 10 ++++++---- > > > > > >> 1 file changed, 6 insertions(+), 4 deletions(-) > > > > > >> > > > > > >> diff --git a/lib/librte_ring/rte_ring_generic.h > > > > > >> b/lib/librte_ring/rte_ring_generic.h > > > > > >> index ea7dbe5b9..477326180 100644 > > > > > >> --- a/lib/librte_ring/rte_ring_generic.h > > > > > >> +++ b/lib/librte_ring/rte_ring_generic.h > > > > > >> @@ -90,9 +90,10 @@ __rte_ring_move_prod_head(struct rte_ring > > > > > >> *r, > > > unsigned int is_sp, > > > > > >> return 0; > > > > > >> > > > > > >> *new_head = *old_head + n; > > > > > >> - if (is_sp) > > > > > >> + if (is_sp) { > > > > > >> + rte_smp_rmb(); > > > > > >> r->prod.head = *new_head, success = 1; > > > > > >> - else > > > > > >> + } else > > > > > >> success = > > > > > >> rte_atomic32_cmpset(&r->prod.head, > > > > > >> *old_head, *new_head); > > > > > >> } while (unlikely(success == 0)); @@ -158,9 +159,10 > > > > > >> @@ __rte_ring_move_cons_head(struct rte_ring *r, unsigned int > is_sc, > > > > > >> return 0; > > > > > >> > > > > > >> *new_head = *old_head + n; > > > > > >> - if (is_sc) > > > > > >> + if (is_sc) { > > > > > >> + rte_smp_rmb(); > > > > > >> r->cons.head = *new_head, success = 1; > > > > > >> - else > > > > > >> + } else > > > > > >> success = > > > > > >> rte_atomic32_cmpset(&r->cons.head, > > *old_head, > > > > > >> *new_head); > > > > > >> } while (unlikely(success == 0)); > > > > > >> -- > > > > > >> 2.17.1 > > > > > >