The solution presented in this RFC is not C11 compliant. 
C11 __atomic_compare_exchange_n updates "expected" only when CAS instruction 
fails. 
Therefore, the assumption that there is an address dependency from CAS 
instructions in both producer/consumer head update to the ring element accesses 
falls apart with respect to semantics of the C11 memory model. 
Thus, this solution may corrupt ring elements when executed on CPUs with weak 
memory models.
Therefore, this RFC will be retracted.

> -----Original Message-----
> From: Wathsala Vithanage <wathsala.vithan...@arm.com>
> Sent: Friday, April 21, 2023 3:17 PM
> To: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>;
> konstantin.v.anan...@yandex.ru; Feifei Wang <feifei.wa...@arm.com>
> Cc: dev@dpdk.org; nd <n...@arm.com>; Wathsala Wathawana Vithanage
> <wathsala.vithan...@arm.com>
> Subject: [RFC] ring: improve ring performance with C11 atomics
> 
> Tail load in __rte_ring_move_cons_head and __rte_ring_move_prod_head can
> be changed to __ATOMIC_RELAXED from __ATOMIC_ACQUIRE.
> Because to calculate the addresses of the dequeue elements
> __rte_ring_dequeue_elems uses the old_head updated by the
> __atomic_compare_exchange_n intrinsic used in __rte_ring_move_prod_head.
> This results in an address dependency between the two operations. Therefore
> __rte_ring_dequeue_elems cannot happen before
> __rte_ring_move_prod_head.
> Similarly __rte_ring_enqueue_elems and __rte_ring_move_cons_head won't be
> reordered either.
> 
> Performance on Arm N1
> Gain relative to generic implementation
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 8 (Arm N1)                         |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 766730 | Total count: 651686  | Total count: 812125  |
> |                     |        Gain: -15%    |        Gain: 6%      |
> +-------------------------------------------------------------------+
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 32 (Arm N1)                        |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 816745 | Total count: 646385  | Total count: 830935  |
> |                     |        Gain: -21%    |        Gain: 2%      |
> +-------------------------------------------------------------------+
> 
> Performance on x86-64 Cascade Lake
> Gain relative to generic implementation
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 8                                  |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 181640 | Total count: 181995  | Total count: 182791  |
> |                     |        Gain: 0.2%    |        Gain: 0.6%
> +-------------------------------------------------------------------+
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 32                                 |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 167495 | Total count: 161536  | Total count: 163190  |
> |                     |        Gain: -3.5%   |        Gain: -2.6%   |
> +-------------------------------------------------------------------+
> 
> Signed-off-by: Wathsala Vithanage <wathsala.vithan...@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>
> Reviewed-by: Feifei Wang <feifei.wa...@arm.com>
> ---
>  .mailmap                    |  1 +
>  lib/ring/rte_ring_c11_pvt.h | 18 +++++++++---------
>  2 files changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/.mailmap b/.mailmap
> index 4018f0fc47..367115d134 100644
> --- a/.mailmap
> +++ b/.mailmap
> @@ -1430,6 +1430,7 @@ Walter Heymans <walter.heym...@corigine.com>
> Wang Sheng-Hui <shh...@gmail.com>  Wangyu (Eric)
> <seven.wan...@huawei.com>  Waterman Cao <waterman....@intel.com>
> +Wathsala Vithanage <wathsala.vithan...@arm.com>
>  Weichun Chen <weichunx.c...@intel.com>
>  Wei Dai <wei....@intel.com>
>  Weifeng Li <liweifen...@126.com>
> diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h index
> f895950df4..1895f2bb0e 100644
> --- a/lib/ring/rte_ring_c11_pvt.h
> +++ b/lib/ring/rte_ring_c11_pvt.h
> @@ -24,6 +24,13 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht,
> uint32_t old_val,
>       if (!single)
>               rte_wait_until_equal_32(&ht->tail, old_val,
> __ATOMIC_RELAXED);
> 
> +     /*
> +      * Updating of ht->tail cannot happen before elements are added to or
> +      * removed from the ring, as it could result in data races between
> +      * producer and consumer threads. Therefore ht->tail should be
> updated
> +      * with release semantics to prevent ring data copy phase from sinking
> +      * below it.
> +      */
>       __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);  }
> 
> @@ -69,11 +76,8 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> unsigned int is_sp,
>               /* Ensure the head is read before tail */
>               __atomic_thread_fence(__ATOMIC_ACQUIRE);
> 
> -             /* load-acquire synchronize with store-release of ht->tail
> -              * in update_tail.
> -              */
>               cons_tail = __atomic_load_n(&r->cons.tail,
> -                                     __ATOMIC_ACQUIRE);
> +                                     __ATOMIC_RELAXED);
> 
>               /* The subtraction is done between two unsigned 32bits value
>                * (the result is always modulo 32 bits even if we have @@ -
> 145,12 +149,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
>               /* Ensure the head is read before tail */
>               __atomic_thread_fence(__ATOMIC_ACQUIRE);
> 
> -             /* this load-acquire synchronize with store-release of ht->tail
> -              * in update_tail.
> -              */
>               prod_tail = __atomic_load_n(&r->prod.tail,
> -                                     __ATOMIC_ACQUIRE);
> -
> +                                     __ATOMIC_RELAXED);
>               /* The subtraction is done between two unsigned 32bits value
>                * (the result is always modulo 32 bits even if we have
>                * cons_head > prod_tail). So 'entries' is always between 0
> --
> 2.25.1

Reply via email to