Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load

2018-09-27 Thread Justin He



> -Original Message-
> From: Gavin Hu (Arm Technology China)
> Sent: 2018年9月26日 17:29
> To: Gavin Hu (Arm Technology China) ; dev@dpdk.org
> Cc: Honnappa Nagarahalli ; Steve Capper
> ; Ola Liljedahl ;
> jerin.ja...@caviumnetworks.com; nd ; sta...@dpdk.org; Justin
> He 
> Subject: RE: [PATCH v3 1/3] ring: read tail using atomic load
>
> +Justin He
>
> > -Original Message-
> > From: Gavin Hu 
> > Sent: Monday, September 17, 2018 4:17 PM
> > To: dev@dpdk.org
> > Cc: Gavin Hu (Arm Technology China) ; Honnappa
> > Nagarahalli ; Steve Capper
> > ; Ola Liljedahl ;
> > jerin.ja...@caviumnetworks.com; nd ; sta...@dpdk.org
> > Subject: [PATCH v3 1/3] ring: read tail using atomic load
> >
> > In update_tail, read ht->tail using __atomic_load.Although the
> > compiler currently seems to be doing the right thing even without
> > _atomic_load, we don't want to give the compiler freedom to optimise
> > what should be an atomic load, it should not be arbitarily moved around.
> >
> > Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> > Cc: sta...@dpdk.org
> >
> > Signed-off-by: Gavin Hu 
> > Reviewed-by: Honnappa Nagarahalli 
> > Reviewed-by: Steve Capper 
> > Reviewed-by: Ola Liljedahl 
> > ---
> >  lib/librte_ring/rte_ring_c11_mem.h | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/lib/librte_ring/rte_ring_c11_mem.h
> > b/lib/librte_ring/rte_ring_c11_mem.h
> > index 94df3c4..234fea0 100644
> > --- a/lib/librte_ring/rte_ring_c11_mem.h
> > +++ b/lib/librte_ring/rte_ring_c11_mem.h
> > @@ -21,7 +21,8 @@ update_tail(struct rte_ring_headtail *ht, uint32_t
> > old_val, uint32_t new_val,
> >   * we need to wait for them to complete
> >   */
> >  if (!single)
> > -while (unlikely(ht->tail != old_val))
> > +while (unlikely(old_val != __atomic_load_n(&ht->tail,
> > +__ATOMIC_RELAXED)))

I still wonder why an atomic load is needed here?
Cheers,
Justin (Jia He)
> >  rte_pause();
> >
> >  __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
> > --
> > 2.7.4

IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


Re: [dpdk-dev] [PATCH v3 3/3] ring: move the atomic load of head above the loop

2018-09-27 Thread Justin He



> -Original Message-
> From: Gavin Hu (Arm Technology China)
> Sent: 2018年9月26日 17:30
> To: Gavin Hu (Arm Technology China) ; dev@dpdk.org
> Cc: Honnappa Nagarahalli ; Steve Capper
> ; Ola Liljedahl ;
> jerin.ja...@caviumnetworks.com; nd ; sta...@dpdk.org; Justin
> He 
> Subject: RE: [PATCH v3 3/3] ring: move the atomic load of head above the loop
>
> +Justin He for review
>
> > -Original Message-
> > From: Gavin Hu 
> > Sent: Monday, September 17, 2018 4:17 PM
> > To: dev@dpdk.org
> > Cc: Gavin Hu (Arm Technology China) ; Honnappa
> > Nagarahalli ; Steve Capper
> > ; Ola Liljedahl ;
> > jerin.ja...@caviumnetworks.com; nd ; sta...@dpdk.org
> > Subject: [PATCH v3 3/3] ring: move the atomic load of head above the loop
> >
> > In __rte_ring_move_prod_head, move the __atomic_load_n up and out of
> > the do {} while loop as upon failure the old_head will be updated, another
> > load is costly and not necessary.
> >
> > This helps a little on the latency,about 1~5%.
> >
> >  Test result with the patch(two cores):
> >  SP/SC bulk enq/dequeue (size: 8): 5.64
> >  MP/MC bulk enq/dequeue (size: 8): 9.58
> >  SP/SC bulk enq/dequeue (size: 32): 1.98  MP/MC bulk enq/dequeue (size:
> > 32): 2.30
> >
> > Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> > Cc: sta...@dpdk.org
> >
> > Signed-off-by: Gavin Hu 
> > Reviewed-by: Honnappa Nagarahalli 
> > Reviewed-by: Steve Capper 
> > Reviewed-by: Ola Liljedahl 
> > ---
> >  lib/librte_ring/rte_ring_c11_mem.h | 10 --
> >  1 file changed, 4 insertions(+), 6 deletions(-)
> >
> > diff --git a/lib/librte_ring/rte_ring_c11_mem.h
> > b/lib/librte_ring/rte_ring_c11_mem.h
> > index 0eae3b3..95cc508 100644
> > --- a/lib/librte_ring/rte_ring_c11_mem.h
> > +++ b/lib/librte_ring/rte_ring_c11_mem.h
> > @@ -61,13 +61,11 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> > unsigned int is_sp,
> >  unsigned int max = n;
> >  int success;
> >
> > +*old_head = __atomic_load_n(&r->prod.head,
> > __ATOMIC_ACQUIRE);
> >  do {
> >  /* Reset n to the initial burst count */
> >  n = max;
> >
> > -*old_head = __atomic_load_n(&r->prod.head,
> > -__ATOMIC_ACQUIRE);
> > -
> >  /* load-acquire synchronize with store-release of ht->tail
> >   * in update_tail.
> >   */
> > @@ -93,6 +91,7 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> > unsigned int is_sp,
> >  if (is_sp)
> >  r->prod.head = *new_head, success = 1;
> >  else
> > +/* on failure, *old_head is updated */
> >  success = __atomic_compare_exchange_n(&r-
> > >prod.head,
> >  old_head, *new_head,
> >  0, __ATOMIC_ACQUIRE,
> > @@ -134,13 +133,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int
> > is_sc,
> >  int success;
> >
> >  /* move cons.head atomically */
> > +*old_head = __atomic_load_n(&r->cons.head,
> > __ATOMIC_ACQUIRE);
> >  do {
> >  /* Restore n as it may change every loop */
> >  n = max;
> >
> > -*old_head = __atomic_load_n(&r->cons.head,
> > -__ATOMIC_ACQUIRE);
> > -
> >  /* this load-acquire synchronize with store-release of ht->tail
> >   * in update_tail.
> >   */
> > @@ -165,6 +162,7 @@ __rte_ring_move_cons_head(struct rte_ring *r, int
> > is_sc,
> >  if (is_sc)
> >  r->cons.head = *new_head, success = 1;
> >  else
> > +/* on failure, *old_head will be updated */
> >  success = __atomic_compare_exchange_n(&r-
> > >cons.head,
> >  old_head,
> > *new_head,
> >  0,
> > __ATOMIC_ACQUIRE,
> > --
> > 2.7.4
Reviewed-by: Jia He 

Cheers,
Justin (Jia He)
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


Re: [dpdk-dev] [PATCH v3 2/3] ring: synchronize the load and store of the tail

2018-09-27 Thread Justin He
Hi Gavin

> -Original Message-
> From: Gavin Hu (Arm Technology China)
> Sent: 2018年9月26日 17:30
> To: Gavin Hu (Arm Technology China) ; dev@dpdk.org
> Cc: Honnappa Nagarahalli ; Steve Capper
> ; Ola Liljedahl ;
> jerin.ja...@caviumnetworks.com; nd ; sta...@dpdk.org; Justin
> He 
> Subject: RE: [PATCH v3 2/3] ring: synchronize the load and store of the tail
>
> +Justin He for review.
>
> > -Original Message-
> > From: Gavin Hu 
> > Sent: Monday, September 17, 2018 4:17 PM
> > To: dev@dpdk.org
> > Cc: Gavin Hu (Arm Technology China) ; Honnappa
> > Nagarahalli ; Steve Capper
> > ; Ola Liljedahl ;
> > jerin.ja...@caviumnetworks.com; nd ; sta...@dpdk.org
> > Subject: [PATCH v3 2/3] ring: synchronize the load and store of the
> > tail
> >
> > Synchronize the load-acquire of the tail and the store-release within
> > update_tail, the store release ensures all the ring operations,
> > enqueue or dequeue, are seen by the observers on the other side as
> > soon as they see the updated tail. The load-acquire is needed here as
> > the data dependency is not a reliable way for ordering as the compiler
> > might break it by saving to temporary values to boost performance.
> > When computing the free_entries and avail_entries, use atomic
> > semantics to load the heads and tails instead.
> >
> > The patch was benchmarked with test/ring_perf_autotest and it
> > decreases the enqueue/dequeue latency by 5% ~ 27.6% with two lcores,
> > the real gains are dependent on the number of lcores, depth of the ring, 
> > SPSC
> or MPMC.
> > For 1 lcore, it also improves a little, about 3 ~ 4%.
> > It is a big improvement, in case of MPMC, with two lcores and ring
> > size of 32, it saves latency up to (3.26-2.36)/3.26 = 27.6%.
> >
> > This patch is a bug fix, while the improvement is a bonus. In our
> > analysis the improvement comes from the cacheline pre-filling after
> > hoisting load- acquire from _atomic_compare_exchange_n up above.
> >
> > The test command:
> > $sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4
> > --socket-mem=\
> > 1024 -- -i
> >
> > Test result with this patch(two cores):
> >  SP/SC bulk enq/dequeue (size: 8): 5.86  MP/MC bulk enq/dequeue (size:
> > 8): 10.15  SP/SC bulk enq/dequeue (size:
> > 32): 1.94  MP/MC bulk enq/dequeue (size: 32): 2.36
> >
> > In comparison of the test result without this patch:
> >  SP/SC bulk enq/dequeue (size: 8): 6.67  MP/MC bulk enq/dequeue (size:
> > 8): 13.12  SP/SC bulk enq/dequeue (size:
> > 32): 2.04  MP/MC bulk enq/dequeue (size: 32): 3.26
> >
> > Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> > Cc: sta...@dpdk.org
> >
> > Signed-off-by: Gavin Hu 
> > Reviewed-by: Honnappa Nagarahalli 
> > Reviewed-by: Steve Capper 
> > Reviewed-by: Ola Liljedahl 
> > ---
> >  lib/librte_ring/rte_ring_c11_mem.h | 20 
> >  1 file changed, 16 insertions(+), 4 deletions(-)
> >
> > diff --git a/lib/librte_ring/rte_ring_c11_mem.h
> > b/lib/librte_ring/rte_ring_c11_mem.h
> > index 234fea0..0eae3b3 100644
> > --- a/lib/librte_ring/rte_ring_c11_mem.h
> > +++ b/lib/librte_ring/rte_ring_c11_mem.h
> > @@ -68,13 +68,18 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> > unsigned int is_sp,
> >  *old_head = __atomic_load_n(&r->prod.head,
> >  __ATOMIC_ACQUIRE);
> >
> > -/*
> > - *  The subtraction is done between two unsigned 32bits
> > value
> > +/* load-acquire synchronize with store-release of ht->tail
> > + * in update_tail.
> > + */
> > +const uint32_t cons_tail = __atomic_load_n(&r->cons.tail,
> > +
> > __ATOMIC_ACQUIRE);
I ever noticed that freebsd also used the double __atomic_load_n. And as we
discussed it several months ago [1], seems the second load_acquire is not
necessary. But as you verified, it looks good to me
[1] http://mails.dpdk.org/archives/dev/2017-November/080983.html
So,
Reviewed-by: Jia He 

Cheers,
Justin (Jia He)
> > +
> > +/* The subtraction is done between two unsigned 32bits
> > value
> >   * (the result is always modulo 32 bits even if we have
> >   * *old_head > cons_tail). So 'free_entries' is always between 0
> >   * and capacity (which is < size).
> >   */
> > -*free_entries = (capacity + r->cons.tail - *old_head);
> > +*free_entries = (capacity + cons_tail - *old_head);
> >
> >  /* check that we have enough room in ring */
> >  if (unlikely(n > *free_entries))
> > @@ -132,15 +137,22 @@