RE: [RFC] ring: further performance improvements with C11

Konstantin Ananyev Mon, 21 Aug 2023 06:27:28 -0700


> > > > > For improved performance over the current C11 based ring
> > > > > implementation following changes were made.
> > > > > (1) Replace tail store with RELEASE semantics in
> > > > > __rte_ring_update_tail with a RELEASE fence. Replace load of the
> > > > > tail with ACQUIRE semantics in __rte_ring_move_prod_head and
> > > > > __rte_ring_move_cons_head with ACQUIRE fences.
> > > > > (2) Remove ACQUIRE fences between load of the old_head and load of
> > > > > the cons_tail in __rte_ring_move_prod_head and
> > __rte_ring_move_cons_head.
> > > > > These two fences are not required for the safety of the ring library.
> > > >
> > > > Hmm... with these changes, aren't we re-introducing the old bug
> > > > fixed by this
> > > > commit:
> > >
> > > Cover letter explains why this barrier does not solve what it intends
> > > to solve and Why it should not matter.
> > > https://mails.dpdk.org/archives/dev/2023-June/270874.html
> >
> > Ok, let's consider the case similar to yours (i), but when r->prod.head was
> > moved for distance greater then r->capacity.
> > To be more specific, let' start with the same initial state:
> > capacity = 32
> > r->cons.tail = 5
> > r->cons.head = 5
> > r->prod.head = 10
> > r-prod.tail = 10
> >
> > time 0,  thread1:
> >     /* re-ordered load */
> >      const_tail = r->cons.tail; //= 5
> >
> > Now, thread1 was stalled for a bit, meanwhile there were few
> What exactly do you mean by 'stalled'?


I mean: CPU pipeline the thread was running on get stalled by whatever reasons -
memory load latency, tlb miss, etc.

> If you are meaning, thread is preempted,
> then the ring algorithm is not designed for it. There
> are restrictions mentioned in [1].

I am not talking about SW thread preemption here.
With the example I provided, I think it is clear that the problem can happen 
even
with 'ideal' conditions: each thread runs non-preempted on a separate core.

> However, we do need to handle the re-ordering case.

To be clear: for me right now this patch is bogus,  it has to be either 
reworked or abandoned.

> 
> [1] 
> https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#known-issues
> 
> > enqueue/dequeus done by other threads, so current state of the ring:
> > r->cons.tail = 105
> > r->cons.head = 105
> > r->prod.head = 110
> > r-prod.tail = 110
> >
> > time 1,  thread1:
> > old_head =  r->prod.head; // 110
> > *free_entries = (capacity + cons_tail - old_head); // = (uint32_t)(32 + 5 - 
> > 110)
> > ==  (uint32_t)-73 == 4294967223
> >

> >  So, free_entries value is way too big, and that comparison:
> >
> > if (unlikely(n > *free_entries))
> >
> > might provide wrong result.
> >
> > So I still think we do need some sort of _read_fence_ between these two 
> > loads.
> > As I said before, that looks exactly like the old bug, fixed a while ago:
> > http://git.dpdk.org/dpdk/commit/?id=9bc2cbb007c0a3335c5582357ae9f6d37
> > ea0b654
> > but now re-introduced for C11 case.
> Agree that the re-ordering case should be handled.

Either handled, or simply not allowed. 

> I am thinking a check (*free_entries > capacity) and restarting the loop might
> suffice (without the barrier)?

I thought about the same thing, and at first glance it seems workable in 
principle.
Though I am still hesitate to remove ordering completely here:
As both compiler and CPU reordering will be possible, it will probably 
introduce a possibility of sort-of ABA problem:
old cons.tail is read at a very early stage
after that cur cons.tail value get wrapped around 2^32 and is now less then old 
cons.tail.
Then we read prod.head
So:
(invalid_free_ent=capacity + cons.tail._old - prod.head) <= capacity
and
(valid_free_ent=capacity + cons.tail._cur - prod.head) <= capacity

Are both true, but invalid > valid and we overestimate number of free entries.

In majority of cases possibility of such situation is negligible.
But for huge values for 'capacity' it will grow.

Can I ask a bit different question:  as I remember this series started as an 
attempt
to improve C11 ring enqueue/dequeue implementation.
Though looking at non-C11 one, there also exist a read-fence between these two 
loads:
https://elixir.bootlin.com/dpdk/v23.07/source/lib/ring/rte_ring_generic_pvt.h#L73
Which makes me - might be it is not actual read-ordering that causes slowdown 
of C11 version.
Did anyone try to compare generated  code for both cases, wonder what is the 
difference?

Konstantin

> >
> > > >
> > > > commit 9bc2cbb007c0a3335c5582357ae9f6d37ea0b654
> > > > Author: Jia He <justin...@arm.com>
> > > > Date:   Fri Nov 10 03:30:42 2017 +0000
> > > >
> > > >     ring: guarantee load/load order in enqueue and dequeue
> > > >
> > > >     We watched a rte panic of mbuf_autotest in our qualcomm arm64 server
> > > >     (Amberwing).
> > > >
> > > >     Root cause:
> > > >     In __rte_ring_move_cons_head()
> > > >     ...
> > > >             do {
> > > >                     /* Restore n as it may change every loop */
> > > >                     n = max;
> > > >
> > > >                     *old_head = r->cons.head;                //1st load
> > > >                     const uint32_t prod_tail = r->prod.tail; //2nd
> > > > load
> > > >
> > > >     In weak memory order architectures (powerpc,arm), the 2nd load might
> > be
> > > >     reodered before the 1st load, that makes *entries is bigger than we
> > wanted.
> > > >     This nasty reording messed enque/deque up.
> > > >     ....
> > > > ?
> > > >
> > > > >
> > > > > Signed-off-by: Wathsala Vithanage <wathsala.vithan...@arm.com>
> > > > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>
> > > > > Reviewed-by: Ruifeng Wang <ruifeng.w...@arm.com>
> > > > > ---
> > > > >  .mailmap                    |  1 +
> > > > >  lib/ring/rte_ring_c11_pvt.h | 35
> > > > > ++++++++++++++++++++---------------
> > > > >  2 files changed, 21 insertions(+), 15 deletions(-)
> > > > >
> > > > > diff --git a/.mailmap b/.mailmap
> > > > > index 4018f0fc47..367115d134 100644
> > > > > --- a/.mailmap
> > > > > +++ b/.mailmap
> > > > > @@ -1430,6 +1430,7 @@ Walter Heymans
> > > > <walter.heym...@corigine.com>
> > > > > Wang Sheng-Hui <shh...@gmail.com>  Wangyu (Eric)
> > > > > <seven.wan...@huawei.com>  Waterman Cao
> > <waterman....@intel.com>
> > > > > +Wathsala Vithanage <wathsala.vithan...@arm.com>
> > > > >  Weichun Chen <weichunx.c...@intel.com>  Wei Dai
> > > > > <wei....@intel.com> Weifeng Li <liweifen...@126.com> diff --git
> > > > > a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h index
> > > > > f895950df4..63fe58ce9e 100644
> > > > > --- a/lib/ring/rte_ring_c11_pvt.h
> > > > > +++ b/lib/ring/rte_ring_c11_pvt.h
> > > > > @@ -16,6 +16,13 @@ __rte_ring_update_tail(struct rte_ring_headtail
> > > > > *ht,
> > > > uint32_t old_val,
> > > > >               uint32_t new_val, uint32_t single, uint32_t enqueue)  {
> > > > >       RTE_SET_USED(enqueue);
> > > > > +     /*
> > > > > +      * Updating of ht->tail cannot happen before elements are added 
> > > > > to or
> > > > > +      * removed from the ring, as it could result in data races 
> > > > > between
> > > > > +      * producer and consumer threads. Therefore we need a release
> > > > > +      * barrier here.
> > > > > +      */
> > > > > +     rte_atomic_thread_fence(__ATOMIC_RELEASE);
> > > > >
> > > > >       /*
> > > > >        * If there are other enqueues/dequeues in progress that
> > > > > preceded us, @@ -24,7 +31,7 @@ __rte_ring_update_tail(struct
> > > > > rte_ring_headtail
> > > > *ht, uint32_t old_val,
> > > > >       if (!single)
> > > > >               rte_wait_until_equal_32(&ht->tail, old_val,
> > > > __ATOMIC_RELAXED);
> > > > >
> > > > > -     __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
> > > > > +     __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELAXED);
> > > > >  }
> > > > >
> > > > >  /**
> > > > > @@ -66,14 +73,8 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> > > > unsigned int is_sp,
> > > > >               /* Reset n to the initial burst count */
> > > > >               n = max;
> > > > >
> > > > > -             /* Ensure the head is read before tail */
> > > > > -             __atomic_thread_fence(__ATOMIC_ACQUIRE);
> > > > > -
> > > > > -             /* load-acquire synchronize with store-release of 
> > > > > ht->tail
> > > > > -              * in update_tail.
> > > > > -              */
> > > > >               cons_tail = __atomic_load_n(&r->cons.tail,
> > > > > -                                     __ATOMIC_ACQUIRE);
> > > > > +                                     __ATOMIC_RELAXED);
> > > > >
> > > > >               /* The subtraction is done between two unsigned 32bits 
> > > > > value
> > > > >                * (the result is always modulo 32 bits even if we have 
> > > > > @@ -
> > > > 100,6
> > > > > +101,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned
> > > > > +int
> > > > is_sp,
> > > > >                                       0, __ATOMIC_RELAXED,
> > > > >                                       __ATOMIC_RELAXED);
> > > > >       } while (unlikely(success == 0));
> > > > > +     /*
> > > > > +      * Ensure that updates to the ring doesn't rise above
> > > > > +      * load of the new_head in SP and MP cases.
> > > > > +      */
> > > > > +     rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
> > > > >       return n;
> > > > >  }
> > > > >
> > > > > @@ -142,14 +148,8 @@ __rte_ring_move_cons_head(struct rte_ring *r,
> > > > > int
> > > > is_sc,
> > > > >               /* Restore n as it may change every loop */
> > > > >               n = max;
> > > > >
> > > > > -             /* Ensure the head is read before tail */
> > > > > -             __atomic_thread_fence(__ATOMIC_ACQUIRE);
> > > > > -
> > > > > -             /* this load-acquire synchronize with store-release of 
> > > > > ht->tail
> > > > > -              * in update_tail.
> > > > > -              */
> > > > >               prod_tail = __atomic_load_n(&r->prod.tail,
> > > > > -                                     __ATOMIC_ACQUIRE);
> > > > > +                                     __ATOMIC_RELAXED);
> > > > >
> > > > >               /* The subtraction is done between two unsigned 32bits 
> > > > > value
> > > > >                * (the result is always modulo 32 bits even if we have 
> > > > > @@ -
> > > > 175,6
> > > > > +175,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int
> > > > > +is_sc,
> > > > >                                                       0,
> > > > __ATOMIC_RELAXED,
> > > > >
> >     __ATOMIC_RELAXED);
> > > > >       } while (unlikely(success == 0));
> > > > > +     /*
> > > > > +      * Ensure that updates to the ring doesn't rise above
> > > > > +      * load of the new_head in SP and MP cases.
> > > > > +      */
> > > > > +     rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
> > > > >       return n;
> > > > >  }
> > > > >
> > > > > --
> > > > > 2.25.1
> > > > >

RE: [RFC] ring: further performance improvements with C11

Reply via email to