Tail load in __rte_ring_move_cons_head and __rte_ring_move_prod_head can be changed to __ATOMIC_RELAXED from __ATOMIC_ACQUIRE. Because to calculate the addresses of the dequeue elements __rte_ring_dequeue_elems uses the old_head updated by the __atomic_compare_exchange_n intrinsic used in __rte_ring_move_prod_head. This results in an address dependency between the two operations. Therefore __rte_ring_dequeue_elems cannot happen before __rte_ring_move_prod_head. Similarly __rte_ring_enqueue_elems and __rte_ring_move_cons_head won't be reordered either.
Performance on Arm N1 Gain relative to generic implementation +-------------------------------------------------------------------+ | Bulk enq/dequeue count on size 8 (Arm N1) | +-------------------------------------------------------------------+ | Generic | C11 atomics | C11 atomics improved | +-------------------------------------------------------------------+ | Total count: 766730 | Total count: 651686 | Total count: 812125 | | | Gain: -15% | Gain: 6% | +-------------------------------------------------------------------+ +-------------------------------------------------------------------+ | Bulk enq/dequeue count on size 32 (Arm N1) | +-------------------------------------------------------------------+ | Generic | C11 atomics | C11 atomics improved | +-------------------------------------------------------------------+ | Total count: 816745 | Total count: 646385 | Total count: 830935 | | | Gain: -21% | Gain: 2% | +-------------------------------------------------------------------+ Performance on x86-64 Cascade Lake Gain relative to generic implementation +-------------------------------------------------------------------+ | Bulk enq/dequeue count on size 8 | +-------------------------------------------------------------------+ | Generic | C11 atomics | C11 atomics improved | +-------------------------------------------------------------------+ | Total count: 181640 | Total count: 181995 | Total count: 182791 | | | Gain: 0.2% | Gain: 0.6% +-------------------------------------------------------------------+ +-------------------------------------------------------------------+ | Bulk enq/dequeue count on size 32 | +-------------------------------------------------------------------+ | Generic | C11 atomics | C11 atomics improved | +-------------------------------------------------------------------+ | Total count: 167495 | Total count: 161536 | Total count: 163190 | | | Gain: -3.5% | Gain: -2.6% | +-------------------------------------------------------------------+ Signed-off-by: Wathsala Vithanage <wathsala.vithan...@arm.com> Reviewed-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com> Reviewed-by: Feifei Wang <feifei.wa...@arm.com> --- .mailmap | 1 + lib/ring/rte_ring_c11_pvt.h | 18 +++++++++--------- 2 files changed, 10 insertions(+), 9 deletions(-) diff --git a/.mailmap b/.mailmap index 4018f0fc47..367115d134 100644 --- a/.mailmap +++ b/.mailmap @@ -1430,6 +1430,7 @@ Walter Heymans <walter.heym...@corigine.com> Wang Sheng-Hui <shh...@gmail.com> Wangyu (Eric) <seven.wan...@huawei.com> Waterman Cao <waterman....@intel.com> +Wathsala Vithanage <wathsala.vithan...@arm.com> Weichun Chen <weichunx.c...@intel.com> Wei Dai <wei....@intel.com> Weifeng Li <liweifen...@126.com> diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h index f895950df4..1895f2bb0e 100644 --- a/lib/ring/rte_ring_c11_pvt.h +++ b/lib/ring/rte_ring_c11_pvt.h @@ -24,6 +24,13 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht, uint32_t old_val, if (!single) rte_wait_until_equal_32(&ht->tail, old_val, __ATOMIC_RELAXED); + /* + * Updating of ht->tail cannot happen before elements are added to or + * removed from the ring, as it could result in data races between + * producer and consumer threads. Therefore ht->tail should be updated + * with release semantics to prevent ring data copy phase from sinking + * below it. + */ __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE); } @@ -69,11 +76,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp, /* Ensure the head is read before tail */ __atomic_thread_fence(__ATOMIC_ACQUIRE); - /* load-acquire synchronize with store-release of ht->tail - * in update_tail. - */ cons_tail = __atomic_load_n(&r->cons.tail, - __ATOMIC_ACQUIRE); + __ATOMIC_RELAXED); /* The subtraction is done between two unsigned 32bits value * (the result is always modulo 32 bits even if we have @@ -145,12 +149,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, /* Ensure the head is read before tail */ __atomic_thread_fence(__ATOMIC_ACQUIRE); - /* this load-acquire synchronize with store-release of ht->tail - * in update_tail. - */ prod_tail = __atomic_load_n(&r->prod.tail, - __ATOMIC_ACQUIRE); - + __ATOMIC_RELAXED); /* The subtraction is done between two unsigned 32bits value * (the result is always modulo 32 bits even if we have * cons_head > prod_tail). So 'entries' is always between 0 -- 2.25.1