> -----Original Message----- > From: Jia He [mailto:hejia...@gmail.com] > Sent: Thursday, November 2, 2017 3:43 PM > To: Ananyev, Konstantin <konstantin.anan...@intel.com>; > jerin.ja...@caviumnetworks.com; dev@dpdk.org; olivier.m...@6wind.com > Cc: Richardson, Bruce <bruce.richard...@intel.com>; jianbo....@arm.com; > hemant.agra...@nxp.com; jie2....@hxt-semitech.com; > bing.z...@hxt-semitech.com; jia...@hxt-semitech.com > Subject: Re: [PATCH v2] ring: guarantee ordering of cons/prod loading when > doing > > Hi Ananyev > > > On 11/2/2017 9:26 PM, Ananyev, Konstantin Wrote: > > Hi Jia, > > > >> -----Original Message----- > >> From: Jia He [mailto:hejia...@gmail.com] > >> Sent: Thursday, November 2, 2017 8:44 AM > >> To: jerin.ja...@caviumnetworks.com; dev@dpdk.org; olivier.m...@6wind.com > >> Cc: Ananyev, Konstantin <konstantin.anan...@intel.com>; Richardson, Bruce > >> <bruce.richard...@intel.com>; jianbo....@arm.com; > >> hemant.agra...@nxp.com; Jia He <hejia...@gmail.com>; > >> jie2....@hxt-semitech.com; bing.z...@hxt-semitech.com; jia.he@hxt- > >> semitech.com > >> Subject: [PATCH v2] ring: guarantee ordering of cons/prod loading when > >> doing > >> > >> We watched a rte panic of mbuf_autotest in our qualcomm arm64 server. > >> As for the possible race condition, please refer to [1]. > >> > >> Furthermore, there are 2 options as suggested by Jerin: > >> 1. use rte_smp_rmb > >> 2. use load_acquire/store_release(refer to [2]). > >> CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER is provided, and by > >> default it is n; > >> > >> The reason why providing 2 options is due to the performance benchmark > >> difference in different arm machines, please refer to [3]. > >> > >> Already fuctionally tested on the machines as follows: > >> on X86(passed the compilation) > >> on arm64 with CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER=y > >> on arm64 with CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER=n > >> > >> [1] http://dpdk.org/ml/archives/dev/2017-October/078366.html > >> [2] https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170 > >> [3] http://dpdk.org/ml/archives/dev/2017-October/080861.html > >> > >> --- > >> Changelog: > >> V2: let users choose whether using load_acquire/store_release > >> V1: rte_smp_rmb() between 2 loads > >> > >> Signed-off-by: Jia He <hejia...@gmail.com> > >> Signed-off-by: jie2....@hxt-semitech.com > >> Signed-off-by: bing.z...@hxt-semitech.com > >> Signed-off-by: jia...@hxt-semitech.com > >> Suggested-by: jerin.ja...@caviumnetworks.com > >> --- > >> lib/librte_ring/Makefile | 4 +++- > >> lib/librte_ring/rte_ring.h | 38 ++++++++++++++++++++++++------ > >> lib/librte_ring/rte_ring_arm64.h | 48 > >> ++++++++++++++++++++++++++++++++++++++ > >> lib/librte_ring/rte_ring_generic.h | 45 > >> +++++++++++++++++++++++++++++++++++ > >> 4 files changed, 127 insertions(+), 8 deletions(-) > >> create mode 100644 lib/librte_ring/rte_ring_arm64.h > >> create mode 100644 lib/librte_ring/rte_ring_generic.h > >> > >> diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile > >> index e34d9d9..fa57a86 100644 > >> --- a/lib/librte_ring/Makefile > >> +++ b/lib/librte_ring/Makefile > >> @@ -45,6 +45,8 @@ LIBABIVER := 1 > >> SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c > >> > >> # install includes > >> -SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h > >> +SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \ > >> + rte_ring_arm64.h \ > >> + rte_ring_generic.h > >> > >> include $(RTE_SDK)/mk/rte.lib.mk > >> diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h > >> index 5e9b3b7..943b1f9 100644 > >> --- a/lib/librte_ring/rte_ring.h > >> +++ b/lib/librte_ring/rte_ring.h > >> @@ -103,6 +103,18 @@ extern "C" { > >> #include <rte_memzone.h> > >> #include <rte_pause.h> > >> > >> +/* In those strong memory models (e.g. x86), there is no need to add rmb() > >> + * between load and load. > >> + * In those weak models(powerpc/arm), there are 2 choices for the users > >> + * 1.use rmb() memory barrier > >> + * 2.use one-direcion load_acquire/store_release barrier > >> + * It depends on performance test results. */ > >> +#ifdef RTE_ARCH_ARM64 > >> +#include "rte_ring_arm64.h" > >> +#else > >> +#include "rte_ring_generic.h" > >> +#endif > >> + > >> #define RTE_TAILQ_RING_NAME "RTE_RING" > >> > >> enum rte_ring_queue_behavior { > >> @@ -368,7 +380,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t > >> old_val, uint32_t new_val, > >> while (unlikely(ht->tail != old_val)) > >> rte_pause(); > >> > >> - ht->tail = new_val; > >> + arch_rte_atomic_store(&ht->tail, new_val, __ATOMIC_RELEASE); > >> } > >> > >> /** > >> @@ -408,7 +420,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int > >> is_sp, > >> /* Reset n to the initial burst count */ > >> n = max; > >> > >> - *old_head = r->prod.head; > >> + *old_head = arch_rte_atomic_load(&r->prod.head, > >> + __ATOMIC_ACQUIRE); > >> const uint32_t cons_tail = r->cons.tail; > > The code starts to look a bit messy with all these arch specific macros... > > So I wonder wouldn't it be more cleaner to: > > > > 1. move existing > > __rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail > > into rte_ring_generic.h > > 2. Add rte_smp_rmb into generic > > __rte_ring_move_prod_head/__rte_ring_move_cons_head > > (as was in v1 of your patch). > > 3. Introduce ARM specific versions of > > __rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail > > in the rte_ring_arm64.h > > > > That way we will keep ogneric code simple and clean, while still allowing > > arch specific optimizations. > Thanks for your review. > But as per your suggestion, there will be at least 2 copies of > __rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail. > Thus, if there are any bugs in the future, both 2 copies have to be > changed, right?
Yes, there would be some code duplication. Though generic code would be much cleaner and simpler to read in that case. Which is worth some dups I think. > > > >> /* > >> * The subtraction is done between two unsigned 32bits value > >> @@ -430,8 +443,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int > >> is_sp, > >> if (is_sp) > >> r->prod.head = *new_head, success = 1; > >> else > >> - success = rte_atomic32_cmpset(&r->prod.head, > >> - *old_head, *new_head); > >> + success = arch_rte_atomic32_cmpset(&r->prod.head, > >> + old_head, *new_head, > >> + 0, __ATOMIC_ACQUIRE, > >> + __ATOMIC_RELAXED); > >> } while (unlikely(success == 0)); > >> return n; > >> } > >> @@ -470,7 +485,10 @@ __rte_ring_do_enqueue(struct rte_ring *r, void * > >> const *obj_table, > >> goto end; > >> > >> ENQUEUE_PTRS(r, &r[1], prod_head, obj_table, n, void *); > >> + > >> +#ifndef RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER > > I wonder why do we need that macro? > > Would be there situations when smp_wmb() are not needed here? > If the dpdk user chooses the config acquire/release, the store_release > barrier in update_tail together with > the load_acquire barrier pair in __rte_ring_move_{prod,cons}_head > guarantee the order. So smp_wmb() is not required here. Please refer to > the freebsd ring implementation and Jerin's debug patch. > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h > https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch Ok, so because you are doing arch_rte_atomic_store(&ht->tail, new_val, __ATOMIC_RELEASE); inside update_tail() you don't need mb() anymore here... In that case, can we probably hide all that logic inside update_tail()? I.E. move sp_rmb/smp_wmb into update_tail() and then for arm64 you'll Have your special one with atomic_store() Something like that for generic version : static __rte_always_inline void update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val, - uint32_t single) + uint32_t single, uint32_t enqueue) { + if (enqueue) + rte_smp_wmb(); + else + rte_smp_rmb(); /* * If there are other enqueues/dequeues in progress that preceded us, * we need to wait for them to complete */ if (!single) while (unlikely(ht->tail != old_val)) rte_pause(); ht->tail = new_val; } .... static __rte_always_inline unsigned int __rte_ring_do_enqueue(struct rte_ring *r, void * const *obj_table, unsigned int n, enum rte_ring_queue_behavior behavior, int is_sp, unsigned int *free_space) { ..... ENQUEUE_PTRS(r, &r[1], prod_head, obj_table, n, void *); - rte_smp_wmb(); - update_tail(&r->prod, prod_head, prod_next, is_sp); + update_tail(&r->prod, prod_head, prod_next, is_sp, 1); .... static __rte_always_inline unsigned int __rte_ring_do_dequeue(struct rte_ring *r, void **obj_table, unsigned int n, enum rte_ring_queue_behavior behavior, int is_sc, unsigned int *available) { .... DEQUEUE_PTRS(r, &r[1], cons_head, obj_table, n, void *); - rte_smp_rmb(); - update_tail(&r->cons, cons_head, cons_next, is_sc); + update_tail(&r->cons, cons_head, cons_next, is_sc, 0); Konstantin > > --- > Cheers, > Jia > > Konstantin > > > > > > -- > Cheers, > Jia