> From: Wathsala Vithanage [mailto:wathsala.vithan...@arm.com] > Sent: Friday, 21 April 2023 21.17 > > Tail load in __rte_ring_move_cons_head and __rte_ring_move_prod_head > can be changed to __ATOMIC_RELAXED from __ATOMIC_ACQUIRE. > Because to calculate the addresses of the dequeue > elements __rte_ring_dequeue_elems uses the old_head updated by the > __atomic_compare_exchange_n intrinsic used in > __rte_ring_move_prod_head. This results in an address dependency > between the two operations. Therefore __rte_ring_dequeue_elems cannot > happen before __rte_ring_move_prod_head. > Similarly __rte_ring_enqueue_elems and __rte_ring_move_cons_head > won't be reordered either.
These preconditions should be added as comments in the source code. > > Performance on Arm N1 > Gain relative to generic implementation > +-------------------------------------------------------------------+ > | Bulk enq/dequeue count on size 8 (Arm N1) | > +-------------------------------------------------------------------+ > | Generic | C11 atomics | C11 atomics improved | > +-------------------------------------------------------------------+ > | Total count: 766730 | Total count: 651686 | Total count: 812125 | > | | Gain: -15% | Gain: 6% | > +-------------------------------------------------------------------+ > +-------------------------------------------------------------------+ > | Bulk enq/dequeue count on size 32 (Arm N1) | > +-------------------------------------------------------------------+ > | Generic | C11 atomics | C11 atomics improved | > +-------------------------------------------------------------------+ > | Total count: 816745 | Total count: 646385 | Total count: 830935 | > | | Gain: -21% | Gain: 2% | > +-------------------------------------------------------------------+ Big performance gain compared to pre-improved C11 atomics! Excellent. > > Performance on x86-64 Cascade Lake > Gain relative to generic implementation > +-------------------------------------------------------------------+ > | Bulk enq/dequeue count on size 8 | > +-------------------------------------------------------------------+ > | Generic | C11 atomics | C11 atomics improved | > +-------------------------------------------------------------------+ > | Total count: 181640 | Total count: 181995 | Total count: 182791 | > | | Gain: 0.2% | Gain: 0.6% > +-------------------------------------------------------------------+ > +-------------------------------------------------------------------+ > | Bulk enq/dequeue count on size 32 | > +-------------------------------------------------------------------+ > | Generic | C11 atomics | C11 atomics improved | > +-------------------------------------------------------------------+ > | Total count: 167495 | Total count: 161536 | Total count: 163190 | > | | Gain: -3.5% | Gain: -2.6% | > +-------------------------------------------------------------------+ I noticed that the larger size (32 objects) had a larger relative drop in performance than the smaller size (8 objects), so I am wondering what the performance numbers are for size 512, the default RTE_MEMPOOL_CACHE_MAX_SIZE? It's probably not going to change anything regarding the patch acceptance, but I'm curious about the numbers. > > Signed-off-by: Wathsala Vithanage <wathsala.vithan...@arm.com> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com> > Reviewed-by: Feifei Wang <feifei.wa...@arm.com> > --- > .mailmap | 1 + > lib/ring/rte_ring_c11_pvt.h | 18 +++++++++--------- > 2 files changed, 10 insertions(+), 9 deletions(-) > > diff --git a/.mailmap b/.mailmap > index 4018f0fc47..367115d134 100644 > --- a/.mailmap > +++ b/.mailmap > @@ -1430,6 +1430,7 @@ Walter Heymans <walter.heym...@corigine.com> > Wang Sheng-Hui <shh...@gmail.com> > Wangyu (Eric) <seven.wan...@huawei.com> > Waterman Cao <waterman....@intel.com> > +Wathsala Vithanage <wathsala.vithan...@arm.com> > Weichun Chen <weichunx.c...@intel.com> > Wei Dai <wei....@intel.com> > Weifeng Li <liweifen...@126.com> > diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h > index f895950df4..1895f2bb0e 100644 > --- a/lib/ring/rte_ring_c11_pvt.h > +++ b/lib/ring/rte_ring_c11_pvt.h > @@ -24,6 +24,13 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht, > uint32_t old_val, > if (!single) > rte_wait_until_equal_32(&ht->tail, old_val, __ATOMIC_RELAXED); > > + /* > + * Updating of ht->tail cannot happen before elements are added to or > + * removed from the ring, as it could result in data races between > + * producer and consumer threads. Therefore ht->tail should be updated > + * with release semantics to prevent ring data copy phase from sinking > + * below it. > + */ I think this comment should clarified as: Updating of ht->tail SHOULD NOT happen before elements are added to or removed from the ring, as it WOULD result in data races between producer and consumer threads. Therefore ht->tail MUST be updated with release semantics to prevent ring data copy phase from sinking below it. Don't use capitals; I only used them to highlight my modifications. NB: English is not my native language, so please ignore me if I am mistaken! > __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE); > } > > @@ -69,11 +76,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int > is_sp, > /* Ensure the head is read before tail */ > __atomic_thread_fence(__ATOMIC_ACQUIRE); > > - /* load-acquire synchronize with store-release of ht->tail > - * in update_tail. > - */ > cons_tail = __atomic_load_n(&r->cons.tail, > - __ATOMIC_ACQUIRE); > + __ATOMIC_RELAXED); The __ATOMIC_ACQUIRE had a comment describing its purpose. Please don't just remove that comment. Please replace it with a new comment describing why __ATOMIC_RELAXED is valid here. > > /* The subtraction is done between two unsigned 32bits value > * (the result is always modulo 32 bits even if we have > @@ -145,12 +149,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, > /* Ensure the head is read before tail */ > __atomic_thread_fence(__ATOMIC_ACQUIRE); > > - /* this load-acquire synchronize with store-release of ht->tail > - * in update_tail. > - */ > prod_tail = __atomic_load_n(&r->prod.tail, > - __ATOMIC_ACQUIRE); > - > + __ATOMIC_RELAXED); Same comment as above: Don't just remove the old comment, please replace it with a new comment. > /* The subtraction is done between two unsigned 32bits value > * (the result is always modulo 32 bits even if we have > * cons_head > prod_tail). So 'entries' is always between 0 > -- > 2.25.1 With comments regarding __ATOMIC_RELAXED added to the source code, Acked-by: Morten Brørup <m...@smartsharesystems.com>