from:"Jia He"

Re: [dpdk-dev] [PATCH V7 2/3] ring: introduce new header file to include common functions

2018-01-15 Thread Jia He




On 1/13/2018 1:09 AM, Thomas Monjalon Wrote:

04/12/2017 02:50, Jia He:

move the common part of rte_ring.h into rte_ring_generic.h.
move the memory barrier part into update_tail().

no functional changes here.

[...]

--- /dev/null
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -0,0 +1,195 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2017 hxt-semitech. All rights reserved.

As you are moving existing code, I think you should keep
the original copyright.

Ok, thanks for the comments

--
Cheers,
Jia

[dpdk-dev] [PATCH v8 0/3] support c11 memory model barrier in librte_ring

2018-01-16 Thread Jia He

To support C11 memory model barrier, 2 options are suggested by Jerin:
1. use rte_smp_rmb
2. use load_acquire/store_release
CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "n"
on any architectures so far.

The reason why providing 2 options is due to the performance benchmark
difference in different arm machines.

Already fuctionally tested on the machines as follows:
- on X86
- on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=y
- on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=n

---
Changelog:
V8: Change the lincense to SPDX tag. Change USE_C11_MEM_MODEL to "n" on
any architectures by default
V7: fix check-git-log warnings which is suggested by Jerin
V6: minor change in subject and log
V5: split it into 2 patchset due to the milestone concerns, this is the 2st
one. Also fix checkpatch.pl warnings
V4: split into small patches
V3: arch specific implementation for enqueue/dequeue barrier
V2: let users choose whether using load_acquire/store_release
V1: rte_smp_rmb() between 2 loads

Jia He (3):
  eal/arm64: remove the braces {} for dmb() and dsb()
  ring: introduce new header file to include common functions
  ring: introduce new header file to support C11 memory model

 config/common_linuxapp |   2 +
 .../common/include/arch/arm/rte_atomic_64.h|   4 +-
 lib/librte_eventdev/rte_event_ring.h   |   6 +-
 lib/librte_ring/Makefile   |   4 +-
 lib/librte_ring/rte_ring.h | 173 ++
 lib/librte_ring/rte_ring_c11_mem.h | 193 
 lib/librte_ring/rte_ring_generic.h | 202 +
 7 files changed, 420 insertions(+), 164 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_c11_mem.h
 create mode 100644 lib/librte_ring/rte_ring_generic.h

-- 
2.7.4

[dpdk-dev] [PATCH v8 1/3] eal/arm64: remove the braces {} for dmb() and dsb()

2018-01-16 Thread Jia He

for the code as follows:
if (condition)
rte_smp_rmb();
else
rte_smp_wmb();
Without this patch, compiler will report this error:
error: 'else' without a previous 'if'

Fixes: 84733fd0d75e ("eal/arm64: fix memory barrier definition")
Signed-off-by: Jia He 
Acked-by: Jerin Jacob 
---
 lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h 
b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
index b6bbd0b..10ccf14 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
@@ -15,8 +15,8 @@ extern "C" {
 
 #include "generic/rte_atomic.h"
 
-#define dsb(opt)  { asm volatile("dsb " #opt : : : "memory"); }
-#define dmb(opt)  { asm volatile("dmb " #opt : : : "memory"); }
+#define dsb(opt) asm volatile("dsb " #opt : : : "memory")
+#define dmb(opt) asm volatile("dmb " #opt : : : "memory")
 
 #define rte_mb() dsb(sy)
 
-- 
2.7.4

[dpdk-dev] [PATCH v8 2/3] ring: introduce new header file to include common functions

2018-01-16 Thread Jia He

Move the common part of rte_ring.h into rte_ring_generic.h.
Move the memory barrier part into update_tail().

No functional changes here.

Signed-off-by: Jia He 
Suggested-by: Jerin Jacob 
Suggested-by: Ananyev Konstantin 
Acked-by: Jerin Jacob 
Acked-by: Olivier Matz 
---
 lib/librte_eventdev/rte_event_ring.h |   6 +-
 lib/librte_ring/Makefile |   3 +-
 lib/librte_ring/rte_ring.h   | 161 +---
 lib/librte_ring/rte_ring_generic.h   | 202 +++
 4 files changed, 210 insertions(+), 162 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_generic.h

diff --git a/lib/librte_eventdev/rte_event_ring.h 
b/lib/librte_eventdev/rte_event_ring.h
index 480d1d0..29d4228 100644
--- a/lib/librte_eventdev/rte_event_ring.h
+++ b/lib/librte_eventdev/rte_event_ring.h
@@ -98,9 +98,8 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r,
goto end;
 
ENQUEUE_PTRS(&r->r, &r[1], prod_head, events, n, struct rte_event);
-   rte_smp_wmb();
 
-   update_tail(&r->r.prod, prod_head, prod_next, 1);
+   update_tail(&r->r.prod, prod_head, prod_next, 1, 1);
 end:
if (free_space != NULL)
*free_space = free_entries - n;
@@ -140,9 +139,8 @@ rte_event_ring_dequeue_burst(struct rte_event_ring *r,
goto end;
 
DEQUEUE_PTRS(&r->r, &r[1], cons_head, events, n, struct rte_event);
-   rte_smp_rmb();
 
-   update_tail(&r->r.cons, cons_head, cons_next, 1);
+   update_tail(&r->r.cons, cons_head, cons_next, 1, 0);
 
 end:
if (available != NULL)
diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index f5f0d35..281ffb7 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -17,6 +17,7 @@ LIBABIVER := 1
 SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
 
 # install includes
-SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
+   rte_ring_generic.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index e924438..519614c 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -356,91 +356,8 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
} \
 } while (0)
 
-static __rte_always_inline void
-update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
-   uint32_t single)
-{
-   /*
-* If there are other enqueues/dequeues in progress that preceded us,
-* we need to wait for them to complete
-*/
-   if (!single)
-   while (unlikely(ht->tail != old_val))
-   rte_pause();
-
-   ht->tail = new_val;
-}
-
-/**
- * @internal This function updates the producer head for enqueue
- *
- * @param r
- *   A pointer to the ring structure
- * @param is_sp
- *   Indicates whether multi-producer path is needed or not
- * @param n
- *   The number of elements we will want to enqueue, i.e. how far should the
- *   head be moved
- * @param behavior
- *   RTE_RING_QUEUE_FIXED:Enqueue a fixed number of items from a ring
- *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
- * @param old_head
- *   Returns head value as it was before the move, i.e. where enqueue starts
- * @param new_head
- *   Returns the current/new head value i.e. where enqueue finishes
- * @param free_entries
- *   Returns the amount of free space in the ring BEFORE head was moved
- * @return
- *   Actual number of objects enqueued.
- *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
- */
-static __rte_always_inline unsigned int
-__rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
-   unsigned int n, enum rte_ring_queue_behavior behavior,
-   uint32_t *old_head, uint32_t *new_head,
-   uint32_t *free_entries)
-{
-   const uint32_t capacity = r->capacity;
-   unsigned int max = n;
-   int success;
-
-   do {
-   /* Reset n to the initial burst count */
-   n = max;
-
-   *old_head = r->prod.head;
-
-   /* add rmb barrier to avoid load/load reorder in weak
-* memory model. It is noop on x86
-*/
-   rte_smp_rmb();
-
-   const uint32_t cons_tail = r->cons.tail;
-   /*
-*  The subtraction is done between two unsigned 32bits value
-* (the result is always modulo 32 bits even if we have
-* *old_head > cons_tail). So 'free_entries' is always between 0
-* and capacity (which is < size).
-*/
-   *free_entries = (capacity + cons_tail - *old_head);
-
-   /* check that we have enough room in ring */
-   if (unlikely(

[dpdk-dev] [PATCH v8 3/3] ring: introduce new header file to support C11 memory model

2018-01-16 Thread Jia He

To support C11 memory model barrier, 2 options are suggested by Jerin:
1. use rte_smp_rmb
2. use load_acquire/store_release(refer to [1]).
CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "n"
on any architectures so far.

The reason why providing 2 options is due to the performance benchmark
difference in different arm machines, refer to [2].

We haven't tested on ppc64. If anyone verifies it, he can add
CONFIG_RTE_RING_USE_C11_MEM_MODEL=y to ppc64 config files.

[1] https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170
[2] http://dpdk.org/ml/archives/dev/2017-October/080861.html

Signed-off-by: Jia He 
Suggested-by: Jerin Jacob 
Acked-by: Jerin Jacob 
Acked-by: Olivier Matz 
Acked-by: Jianbo Liu 
---
 config/common_linuxapp |   2 +
 lib/librte_ring/Makefile   |   3 +-
 lib/librte_ring/rte_ring.h |  14 ++-
 lib/librte_ring/rte_ring_c11_mem.h | 193 +
 4 files changed, 210 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_c11_mem.h

diff --git a/config/common_linuxapp b/config/common_linuxapp
index 74c7d64..999fbaa 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -50,3 +50,5 @@ CONFIG_RTE_LIBRTE_AVP_PMD=y
 CONFIG_RTE_LIBRTE_NFP_PMD=y
 CONFIG_RTE_LIBRTE_POWER=y
 CONFIG_RTE_VIRTIO_USER=y
+
+CONFIG_RTE_RING_USE_C11_MEM_MODEL=n
diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index 281ffb7..bde8907 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -18,6 +18,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
 
 # install includes
 SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
-   rte_ring_generic.h
+   rte_ring_generic.h \
+   rte_ring_c11_mem.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 519614c..3343eba 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -356,8 +356,20 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
} \
 } while (0)
 
-/* Move common functions to generic file */
+/* Between load and load. there might be cpu reorder in weak model
+ * (powerpc/arm).
+ * There are 2 choices for the users
+ * 1.use rmb() memory barrier
+ * 2.use one-direcion load_acquire/store_release barrier,defined by
+ * CONFIG_RTE_RING_USE_C11_MEM_MODEL=y
+ * It depends on performance test results.
+ * By default, move common functions to rte_ring_generic.h
+ */
+#ifdef RTE_RING_USE_C11_MEM_MODEL
+#include "rte_ring_c11_mem.h"
+#else
 #include "rte_ring_generic.h"
+#endif
 
 /**
  * @internal Enqueue several objects on the ring
diff --git a/lib/librte_ring/rte_ring_c11_mem.h 
b/lib/librte_ring/rte_ring_c11_mem.h
new file mode 100644
index 000..20b7ff6
--- /dev/null
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -0,0 +1,193 @@
+/*-
+ *   Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   SPDX-License-Identifier: BSD-3-Clause
+ */
+
+/*
+ * Derived from FreeBSD's bufring.h
+ *
+ **
+ *
+ * Copyright (c) 2007-2009 Kip Macy km...@freebsd.org
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice,
+ *this list of conditions and the following disclaimer.
+ *
+ * 2. The name of Kip Macy nor the names of other
+ *contributors may be used to endorse or promote products derived from
+ *this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+ * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ ***/
+
+#ifndef _RTE_RING_C11_MEM_H_
+#define _RTE_RING_C11_MEM_H_
+
+static __rte_always_inline void
+update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
+   uint32_t single, uint32_t enqueue)
+{
+   RTE_SET_U

Re: [dpdk-dev] [PATCH v8 3/3] ring: introduce new header file to support C11 memory model

2018-01-17 Thread Jia He



Hi Thomas

On 1/17/2018 4:24 PM, Thomas Monjalon Wrote:

17/01/2018 05:03, Jia He:

To support C11 memory model barrier, 2 options are suggested by Jerin:
1. use rte_smp_rmb
2. use load_acquire/store_release(refer to [1]).
CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "n"
on any architectures so far.

In previous patches, it was enabled for ARM.
You decided to not enable it at all?

Sorry, maybe I misunderstand your previous mail.
>This config option should be added in the common file (as disabled).
Do you mean  CONFIG_RTE_RING_USE_C11_MEM_MODEL=n in comm_base and
"y" in armv8 config?

Cheers,
Jia



  config/common_linuxapp |   2 +

It should be defined in common_base, not common_linuxapp.


--- /dev/null
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -0,0 +1,193 @@
+/*-
+ *   Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   SPDX-License-Identifier: BSD-3-Clause
+ */

It is not complying with the template.
Please check the license/ directory.

Why is it Intel Copyright?
"All rights reserved" is probably not needed.


+/*
+ * Derived from FreeBSD's bufring.h
+ *
+ **
+ *
+ * Copyright (c) 2007-2009 Kip Macy km...@freebsd.org
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice,
+ *this list of conditions and the following disclaimer.
+ *
+ * 2. The name of Kip Macy nor the names of other
+ *contributors may be used to endorse or promote products derived from
+ *this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+ * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ ***/

This double license may be an issue.
Hemant, comment?



--
Cheers,
Jia

Re: [dpdk-dev] [PATCH v8 3/3] ring: introduce new header file to support C11 memory model

2018-01-18 Thread Jia He




On 1/19/2018 7:52 AM, Thomas Monjalon Wrote:

17/01/2018 10:09, Thomas Monjalon:

17/01/2018 09:47, Jia He:

Hi Thomas

On 1/17/2018 4:24 PM, Thomas Monjalon Wrote:

17/01/2018 05:03, Jia He:

To support C11 memory model barrier, 2 options are suggested by Jerin:
1. use rte_smp_rmb
2. use load_acquire/store_release(refer to [1]).
CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "n"
on any architectures so far.

In previous patches, it was enabled for ARM.
You decided to not enable it at all?

Sorry, maybe I misunderstand your previous mail.
  >This config option should be added in the common file (as disabled).
Do you mean  CONFIG_RTE_RING_USE_C11_MEM_MODEL=n in comm_base and
"y" in armv8 config?

Yes, exactly

Please, could you send a v9?
Thanks



This double  license may be an issue.



Hemant,  comment?


Ok， should I remove the double license info before Hemant's confirmation?

--
Cheers,
Jia

Re: [dpdk-dev] [PATCH v8 2/3] ring: introduce new header file to include common functions

2018-01-21 Thread Jia He


Hi Hermant


On 1/20/2018 12:47 AM, Hemant Agrawal Wrote:

Hi Olivier,


On Fri, Jan 19, 2018 at 07:45:30PM +0530, Hemant Agrawal wrote:

Hi Jia,

On 1/17/2018 9:33 AM, Jia He wrote:

Move the common part of rte_ring.h into rte_ring_generic.h.
Move the memory barrier part into update_tail().

No functional changes here.

Signed-off-by: Jia He 
Suggested-by: Jerin Jacob 
Suggested-by: Ananyev Konstantin 
Acked-by: Jerin Jacob 
Acked-by: Olivier Matz 
---
diff --git a/lib/librte_ring/rte_ring_generic.h
b/lib/librte_ring/rte_ring_generic.h
new file mode 100644
index 000..01f2cae
--- /dev/null
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -0,0 +1,202 @@
+/*-
+ *   Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   SPDX-License-Identifier: BSD-3-Clause

The SPDX should be first line. See other files for Intel or NXP.

[Hemant] Don't add SPDX to this file.
  This file is not BSD-3 licensed.  Please keep the full text as in the 
original file.



+ */
+
+/*
+ * Derived from FreeBSD's bufring.h
+ *
+


+*
**

+***
+ *
+ * Copyright (c) 2007-2009 Kip Macy km...@freebsd.org
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or
+without
+ * modification, are permitted provided that the following conditions

are met:

+ *
+ * 1. Redistributions of source code must retain the above copyright

notice,

+ *this list of conditions and the following disclaimer.
+ *
+ * 2. The name of Kip Macy nor the names of other
+ *contributors may be used to endorse or promote products derived

from

+ *this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND

CONTRIBUTORS "AS IS"

+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+CONTRIBUTORS BE
+ * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,

EXEMPLARY,

+OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;

OR

+BUSINESS
+ * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF

LIABILITY,

+WHETHER IN
+ * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
+OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
+ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+


+*
**

+/
+

This is BSD-2-freebsd, which is not a approved license for DPDK.
Can you ask Kip Macy, if he/she is ok to re-license it with BSD-3?

Please check with legal, if you can just keep the copyright of Kip
Macy and re license it with BSD-3.

I see the BSD-3 license to be permissive enough to be re-licensed as BSD-3.
But I am not a lawyer.


I agree this is something we should do, as a maintainer of librte_ring, I can
do it.

But here, Jia is just moving code in a new file. I don't think this should block
his patchset from beeing included.

  [Hemant]   I thought of blocking this kind of moves, so that we get the 
license complaint of DPDK faster 😊

Jia, shall keep the original copyrights and headers in this file (i.e. No 
SPDX).   You need to fix it along with rte_ring.h in near future.

Regards,
Hemant


Ok, I will
Besides ,I got the allowance from Kip Macy just now. He/She allowed dpdk 
to license librte_ring.h as BSD-3.


My question:
>Would you mind allowing dpdk librte_ring.h to be licensed as BSD 3 
instead of BSD 2?

His/her reply:

"I think that's fine. If you're using it be careful I think there's a
fix to memory barrier usage needed more relaxed memory models such as
ARM. I'll check reviews to see if it made it in or not."

--
Cheers,
Jia

[dpdk-dev] [PATCH v9 1/3] eal/arm64: remove the braces {} for dmb() and dsb()

2018-01-21 Thread Jia He

for the code as follows:
if (condition)
rte_smp_rmb();
else
rte_smp_wmb();
Without this patch, compiler will report this error:
error: 'else' without a previous 'if'

Fixes: 84733fd0d75e ("eal/arm64: fix memory barrier definition")
Signed-off-by: Jia He 
Acked-by: Jerin Jacob 
---
 lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h 
b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
index b6bbd0b..10ccf14 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
@@ -15,8 +15,8 @@ extern "C" {
 
 #include "generic/rte_atomic.h"
 
-#define dsb(opt)  { asm volatile("dsb " #opt : : : "memory"); }
-#define dmb(opt)  { asm volatile("dmb " #opt : : : "memory"); }
+#define dsb(opt) asm volatile("dsb " #opt : : : "memory")
+#define dmb(opt) asm volatile("dmb " #opt : : : "memory")
 
 #define rte_mb() dsb(sy)
 
-- 
2.7.4

[dpdk-dev] [PATCH v9 0/3] support c11 memory model barrier in librte_ring

2018-01-21 Thread Jia He

There are 2 model barrier options in librte_ring suggested by Jerin:
1. use rte_smp_rmb
2. use load_acquire/store_release

CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided for supporting C11 memory
model barrier in librte_ring. By default it is "y" on arm64, "n" on any
other architectures so far.

Already fuctionally tested on the machines as follows:
- on X86
- on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=y
- on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=n

---
Changelog:
V9: remove the SPDX tag and refine commit logs
V8: Change the license to SPDX tag. Change USE_C11_MEM_MODEL to "n" on
any architectures by default
V7: fix check-git-log warnings which is suggested by Jerin
V6: minor change in subject and log
V5: split it into 2 patchset due to the milestone concerns, this is the 2st
one. Also fix checkpatch.pl warnings
V4: split into small patches
V3: arch specific implementation for enqueue/dequeue barrier
V2: let users choose whether using load_acquire/store_release
V1: rte_smp_rmb() between 2 loads

Jia He (3):
  eal/arm64: remove the braces {} for dmb() and dsb()
  ring: introduce new header file to include common functions
  ring: introduce new header file to support C11 memory model

 config/common_armv8a_linuxapp  |   2 +
 config/common_base |   5 +
 .../common/include/arch/arm/rte_atomic_64.h|   4 +-
 lib/librte_eventdev/rte_event_ring.h   |   6 +-
 lib/librte_ring/Makefile   |   4 +-
 lib/librte_ring/rte_ring.h | 173 ++--
 lib/librte_ring/rte_ring_c11_mem.h | 218 
 lib/librte_ring/rte_ring_generic.h | 228 +
 8 files changed, 476 insertions(+), 164 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_c11_mem.h
 create mode 100644 lib/librte_ring/rte_ring_generic.h

-- 
2.7.4

[dpdk-dev] [PATCH v9 3/3] ring: introduce new header file to support C11 memory model

2018-01-21 Thread Jia He

This patch is to support C11 memory model barrier in librte_ring.

There are 2 barrier implementation options in librte_ring (suggested
by Jerin).
1. use rte_smp_rmb
2. use load_acquire/store_release(refer to [1]).
The reason why providing 2 options is the performance benchmark
difference in different arm machines, refer to [2].

CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "n"
on any architectures and only "y" on arm64 so far.

[1] https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170
[2] http://dpdk.org/ml/archives/dev/2017-October/080861.html

Signed-off-by: Jia He 
Suggested-by: Jerin Jacob 
Acked-by: Jerin Jacob 
Acked-by: Olivier Matz 
Acked-by: Jianbo Liu 
---
 config/common_armv8a_linuxapp  |   2 +
 config/common_base |   5 +
 lib/librte_ring/Makefile   |   3 +-
 lib/librte_ring/rte_ring.h |  14 ++-
 lib/librte_ring/rte_ring_c11_mem.h | 218 +
 5 files changed, 240 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_c11_mem.h

diff --git a/config/common_armv8a_linuxapp b/config/common_armv8a_linuxapp
index 790e716..0a1f370 100644
--- a/config/common_armv8a_linuxapp
+++ b/config/common_armv8a_linuxapp
@@ -34,3 +34,5 @@ CONFIG_RTE_LIBRTE_SFC_EFX_PMD=n
 CONFIG_RTE_LIBRTE_AVP_PMD=n
 
 CONFIG_RTE_SCHED_VECTOR=n
+
+CONFIG_RTE_RING_USE_C11_MEM_MODEL=y
diff --git a/config/common_base b/config/common_base
index 170a389..0aadda8 100644
--- a/config/common_base
+++ b/config/common_base
@@ -843,3 +843,8 @@ CONFIG_RTE_APP_CRYPTO_PERF=y
 # Compile the eventdev application
 #
 CONFIG_RTE_APP_EVENTDEV=y
+
+#
+# Use C11 memory model barrier in Librte_ring
+#
+CONFIG_RTE_RING_USE_C11_MEM_MODEL=n
diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index 281ffb7..bde8907 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -18,6 +18,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
 
 # install includes
 SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
-   rte_ring_generic.h
+   rte_ring_generic.h \
+   rte_ring_c11_mem.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index bc3756a..cbf2f19 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -357,8 +357,20 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
} \
 } while (0)
 
-/* Move common functions to generic file */
+/* Between load and load. there might be cpu reorder in weak model
+ * (powerpc/arm).
+ * There are 2 choices for the users
+ * 1.use rmb() memory barrier
+ * 2.use one-direcion load_acquire/store_release barrier,defined by
+ * CONFIG_RTE_RING_USE_C11_MEM_MODEL=y
+ * It depends on performance test results.
+ * By default, move common functions to rte_ring_generic.h
+ */
+#ifdef RTE_RING_USE_C11_MEM_MODEL
+#include "rte_ring_c11_mem.h"
+#else
 #include "rte_ring_generic.h"
+#endif
 
 /**
  * @internal Enqueue several objects on the ring
diff --git a/lib/librte_ring/rte_ring_c11_mem.h 
b/lib/librte_ring/rte_ring_c11_mem.h
new file mode 100644
index 000..8ccb937
--- /dev/null
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -0,0 +1,218 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2017 HXT-semitech Corporation.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ *   notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ *   notice, this list of conditions and the following disclaimer in
+ *   the documentation and/or other materials provided with the
+ *   distribution.
+ * * Neither the name of HXT-semitech Corporation nor the names of its
+ *   contributors may be used to endorse or promote products derived
+ *   from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SO

[dpdk-dev] [PATCH v9 2/3] ring: introduce new header file to include common functions

2018-01-21 Thread Jia He

Move the common part of rte_ring.h into rte_ring_generic.h.
Move the memory barrier part into update_tail().

No functional changes here.

Signed-off-by: Jia He 
Suggested-by: Jerin Jacob 
Suggested-by: Ananyev Konstantin 
Acked-by: Jerin Jacob 
Acked-by: Olivier Matz 
---
 lib/librte_eventdev/rte_event_ring.h |   6 +-
 lib/librte_ring/Makefile |   3 +-
 lib/librte_ring/rte_ring.h   | 161 +
 lib/librte_ring/rte_ring_generic.h   | 228 +++
 4 files changed, 236 insertions(+), 162 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_generic.h

diff --git a/lib/librte_eventdev/rte_event_ring.h 
b/lib/librte_eventdev/rte_event_ring.h
index 480d1d0..29d4228 100644
--- a/lib/librte_eventdev/rte_event_ring.h
+++ b/lib/librte_eventdev/rte_event_ring.h
@@ -98,9 +98,8 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r,
goto end;
 
ENQUEUE_PTRS(&r->r, &r[1], prod_head, events, n, struct rte_event);
-   rte_smp_wmb();
 
-   update_tail(&r->r.prod, prod_head, prod_next, 1);
+   update_tail(&r->r.prod, prod_head, prod_next, 1, 1);
 end:
if (free_space != NULL)
*free_space = free_entries - n;
@@ -140,9 +139,8 @@ rte_event_ring_dequeue_burst(struct rte_event_ring *r,
goto end;
 
DEQUEUE_PTRS(&r->r, &r[1], cons_head, events, n, struct rte_event);
-   rte_smp_rmb();
 
-   update_tail(&r->r.cons, cons_head, cons_next, 1);
+   update_tail(&r->r.cons, cons_head, cons_next, 1, 0);
 
 end:
if (available != NULL)
diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index f5f0d35..281ffb7 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -17,6 +17,7 @@ LIBABIVER := 1
 SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
 
 # install includes
-SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
+   rte_ring_generic.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 7069d52..bc3756a 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -357,91 +357,8 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
} \
 } while (0)
 
-static __rte_always_inline void
-update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
-   uint32_t single)
-{
-   /*
-* If there are other enqueues/dequeues in progress that preceded us,
-* we need to wait for them to complete
-*/
-   if (!single)
-   while (unlikely(ht->tail != old_val))
-   rte_pause();
-
-   ht->tail = new_val;
-}
-
-/**
- * @internal This function updates the producer head for enqueue
- *
- * @param r
- *   A pointer to the ring structure
- * @param is_sp
- *   Indicates whether multi-producer path is needed or not
- * @param n
- *   The number of elements we will want to enqueue, i.e. how far should the
- *   head be moved
- * @param behavior
- *   RTE_RING_QUEUE_FIXED:Enqueue a fixed number of items from a ring
- *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
- * @param old_head
- *   Returns head value as it was before the move, i.e. where enqueue starts
- * @param new_head
- *   Returns the current/new head value i.e. where enqueue finishes
- * @param free_entries
- *   Returns the amount of free space in the ring BEFORE head was moved
- * @return
- *   Actual number of objects enqueued.
- *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
- */
-static __rte_always_inline unsigned int
-__rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
-   unsigned int n, enum rte_ring_queue_behavior behavior,
-   uint32_t *old_head, uint32_t *new_head,
-   uint32_t *free_entries)
-{
-   const uint32_t capacity = r->capacity;
-   unsigned int max = n;
-   int success;
-
-   do {
-   /* Reset n to the initial burst count */
-   n = max;
-
-   *old_head = r->prod.head;
-
-   /* add rmb barrier to avoid load/load reorder in weak
-* memory model. It is noop on x86
-*/
-   rte_smp_rmb();
-
-   const uint32_t cons_tail = r->cons.tail;
-   /*
-*  The subtraction is done between two unsigned 32bits value
-* (the result is always modulo 32 bits even if we have
-* *old_head > cons_tail). So 'free_entries' is always between 0
-* and capacity (which is < size).
-*/
-   *free_entries = (capacity + cons_tail - *old_head);
-
-   /* check that we have enough room in ring */
-   if (unlikely(

Re: [dpdk-dev] [PATCH v8 2/3] ring: introduce new header file to include common functions

2018-01-21 Thread Jia He


Hi hermant


On 1/22/2018 1:15 PM, Hemant Agrawal Wrote:

Hi Jia,

On 1/22/2018 7:23 AM, Jia He wrote:

This is BSD-2-freebsd, which is not a approved license for DPDK.
Can you ask Kip Macy, if he/she is ok to re-license it with BSD-3?

Please check with legal, if you can just keep the copyright of Kip
Macy and re license it with BSD-3.

I see the BSD-3 license to be permissive enough to be re-licensed as
BSD-3.
But I am not a lawyer.


I agree this is something we should do, as a maintainer of
librte_ring, I can
do it.

But here, Jia is just moving code in a new file. I don't think this
should block
his patchset from beeing included.

  [Hemant]   I thought of blocking this kind of moves, so that we get
the license complaint of DPDK faster 😊

Jia, shall keep the original copyrights and headers in this file (i.e.
No SPDX).   You need to fix it along with rte_ring.h in near future.

Regards,
Hemant


Ok, I will
Besides ,I got the allowance from Kip Macy just now. He/She allowed dpdk
to license librte_ring.h as BSD-3.

My question:

Would you mind allowing dpdk librte_ring.h to be licensed as BSD 3

instead of BSD 2?
His/her reply:

"I think that's fine. If you're using it be careful I think there's a
fix to memory barrier usage needed more relaxed memory models such as
ARM. I'll check reviews to see if it made it in or not."



That is good. Your Patch v9 looks ok.

Will you please also add another patch for following:
(all files in librte_ring - having BSD-2 license)

/* SPDX-License-Identifier: BSD-3-Clause
 *
 * Copyright . (Intel or your company)
 * Copyright (c) 2007-2009 Kip Macy km...@freebsd.org
 * All rights reserved.
 * Derived from FreeBSD's bufring.h
 * Used as BSD-3 Licensed with permission from Kip Macy.
 */



With pleasure, thanks

--
Cheers,
Jia

[dpdk-dev] [PATCH] ring: convert license headers to SPDX tags

2018-01-21 Thread Jia He

Signed-off-by: Jia He 
---
 lib/librte_ring/rte_ring.c | 66 +++---
 lib/librte_ring/rte_ring.h | 66 +++---
 lib/librte_ring/rte_ring_c11_mem.h | 65 +++--
 lib/librte_ring/rte_ring_generic.h | 66 +++---
 4 files changed, 20 insertions(+), 243 deletions(-)

diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
index 036467f..d215ace 100644
--- a/lib/librte_ring/rte_ring.c
+++ b/lib/librte_ring/rte_ring.c
@@ -1,67 +1,11 @@
-/*-
- *   BSD LICENSE
- *
- *   Copyright(c) 2010-2015 Intel Corporation. All rights reserved.
- *   All rights reserved.
- *
- *   Redistribution and use in source and binary forms, with or without
- *   modification, are permitted provided that the following conditions
- *   are met:
- *
- * * Redistributions of source code must retain the above copyright
- *   notice, this list of conditions and the following disclaimer.
- * * Redistributions in binary form must reproduce the above copyright
- *   notice, this list of conditions and the following disclaimer in
- *   the documentation and/or other materials provided with the
- *   distribution.
- * * Neither the name of Intel Corporation nor the names of its
- *   contributors may be used to endorse or promote products derived
- *   from this software without specific prior written permission.
- *
- *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
- *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
- *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
- *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
- *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
- *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
- *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
- *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
- *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- */
-
-/*
- * Derived from FreeBSD's bufring.c
- *
- **
+/* SPDX-License-Identifier: BSD-3-Clause
  *
+ * Copyright (c) 2010-2015 Intel Corporation
  * Copyright (c) 2007,2008 Kip Macy km...@freebsd.org
  * All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *
- * 1. Redistributions of source code must retain the above copyright notice,
- *this list of conditions and the following disclaimer.
- *
- * 2. The name of Kip Macy nor the names of other
- *contributors may be used to endorse or promote products derived from
- *this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
- * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
- * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
- * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
- * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
- * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
- * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
- * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
- * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
- * POSSIBILITY OF SUCH DAMAGE.
- *
- ***/
+ * Derived from FreeBSD's bufring.h
+ * Used as BSD-3 Licensed with permission from Kip Macy.
+ */
 
 #include 
 #include 
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index cbf2f19..253cdc9 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -1,67 +1,11 @@
-/*-
- *   BSD LICENSE
- *
- *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
- *   All rights reserved.
- *
- *   Redistribution and use in source and binary forms, with or without
- *   modification, are permitted provided that the following conditions
- *   are met:
- *
- * * Redistributions of source code must retain the above copyright
- *   notice, this list of conditions and the following disclaimer.
- * * Redistributions in binary form must reproduce the above copyright
- *   notice, this list of conditions and the following disclaimer in
- *   the documentation and/or other materials provided with the
- *   distributi

[dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue

2017-10-10 Thread Jia He

Before this patch:
In __rte_ring_move_cons_head()
...
do {
/* Restore n as it may change every loop */
n = max;

*old_head = r->cons.head;//1st load
const uint32_t prod_tail = r->prod.tail; //2nd load

In weak memory order architectures(powerpc,arm), the 2nd load might be
reodered before the 1st load, that makes *entries is bigger than we wanted.
This nasty reording messed enque/deque up.

cpu1(producer)  cpu2(consumer)  cpu3(consumer)
load r->prod.tail
in enqueue:
load r->cons.tail
load r->prod.head

store r->prod.tail

load r->cons.head
load r->prod.tail
...
store r->cons.{head,tail}
load r->cons.head

THEN,r->cons.head will be bigger than prod_tail, then make *entries very big

After this patch, the old cons.head will be recaculated after failure of
rte_atomic32_cmpset

There is no such issue in X86 cpu, because X86 is strong memory order model

Signed-off-by: Jia He 
Signed-off-by: jia...@hxt-semitech.com
Signed-off-by: jie2@hxt-semitech.com
Signed-off-by: bing.z...@hxt-semitech.com

---
 lib/librte_ring/rte_ring.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 5e9b3b7..15c72e2 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -409,6 +409,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
n = max;
 
*old_head = r->prod.head;
+
+   /* load of prod.tail can't be reordered before cons.head */
+   rte_smp_rmb();
+
const uint32_t cons_tail = r->cons.tail;
/*
 *  The subtraction is done between two unsigned 32bits value
@@ -517,6 +521,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
n = max;
 
*old_head = r->cons.head;
+
+   /* load of prod.tail can't be reordered before cons.head */
+   rte_smp_rmb();
+
const uint32_t prod_tail = r->prod.tail;
/* The subtraction is done between two unsigned 32bits value
 * (the result is always modulo 32 bits even if we have
-- 
2.7.4

Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue

2017-10-12 Thread Jia He

Hi Jerin

On 10/13/2017 1:23 AM, Jerin Jacob Wrote:

-Original Message-

Date: Thu, 12 Oct 2017 17:05:50 +
From: "Ananyev, Konstantin" 
To: Olivier MATZ , Jia He 
CC: "dev@dpdk.org" , "jia...@hxt-semitech.com"
  , "jie2@hxt-semitech.com"
  , "bing.z...@hxt-semitech.com"
  , "jerin.ja...@caviumnetworks.com"

Subject: RE: [PATCH] ring: guarantee ordering of cons/prod loading when
  doing enqueue/dequeue

Hi guys,

-Original Message-
From: Olivier MATZ [mailto:olivier.m...@6wind.com]
Sent: Thursday, October 12, 2017 4:54 PM
To: Jia He 
Cc: dev@dpdk.org; jia...@hxt-semitech.com; jie2@hxt-semitech.com; 
bing.z...@hxt-semitech.com; Ananyev, Konstantin
; jerin.ja...@caviumnetworks.com
Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading when doing 
enqueue/dequeue

Hi,

On Tue, Oct 10, 2017 at 05:56:36PM +0800, Jia He wrote:

Before this patch:
In __rte_ring_move_cons_head()
...
 do {
 /* Restore n as it may change every loop */
 n = max;

 *old_head = r->cons.head;//1st load
 const uint32_t prod_tail = r->prod.tail; //2nd load

In weak memory order architectures(powerpc,arm), the 2nd load might be
reodered before the 1st load, that makes *entries is bigger than we wanted.
This nasty reording messed enque/deque up.

cpu1(producer)  cpu2(consumer)  cpu3(consumer)
 load r->prod.tail
in enqueue:
load r->cons.tail
load r->prod.head

store r->prod.tail

 load r->cons.head
 load r->prod.tail
 ...
 store r->cons.{head,tail}
 load r->cons.head

THEN,r->cons.head will be bigger than prod_tail, then make *entries very big

After this patch, the old cons.head will be recaculated after failure of
rte_atomic32_cmpset

There is no such issue in X86 cpu, because X86 is strong memory order model

Signed-off-by: Jia He 
Signed-off-by: jia...@hxt-semitech.com
Signed-off-by: jie2@hxt-semitech.com
Signed-off-by: bing.z...@hxt-semitech.com

---
  lib/librte_ring/rte_ring.h | 8 
  1 file changed, 8 insertions(+)

diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 5e9b3b7..15c72e2 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -409,6 +409,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
n = max;

*old_head = r->prod.head;
+
+   /* load of prod.tail can't be reordered before cons.head */
+   rte_smp_rmb();
+
const uint32_t cons_tail = r->cons.tail;
/*
 *  The subtraction is done between two unsigned 32bits value
@@ -517,6 +521,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
n = max;

*old_head = r->cons.head;
+
+   /* load of prod.tail can't be reordered before cons.head */
+   rte_smp_rmb();
+
const uint32_t prod_tail = r->prod.tail;
/* The subtraction is done between two unsigned 32bits value
 * (the result is always modulo 32 bits even if we have
--
2.7.4

The explanation convinces me.

However, since it's in a critical path, it would be good to have other
opinions. This patch reminds me this discussion, that was also related to
memory barrier, but at another place:
http://dpdk.org/ml/archives/dev/2016-July/043765.html
Lead to that patch: http://dpdk.org/browse/dpdk/commit/?id=ecc7d10e448e
But finally reverted: http://dpdk.org/browse/dpdk/commit/?id=c3acd92746c3

Konstatin, Jerin, do you have any comment?

For IA, as rte_smp_rmb() is just a compiler_barrier, that patch shouldn't make 
any difference,
but  I can't see how read reordering would screw things up here...
Probably just me and arm or ppc guys could explain what will be the problem
if let say cons.tail will be read before prod.head in 
__rte_ring_move_prod_head().
I wonder Is there a simple test-case to reproduce that problem (on arm or ppc)?
Probably new test-case for rte_ring autotest is needed, or is it possible to 
reproduce
it with existing one?

On the same lines,

Jia He, jie2.liu, bing.zhao,

Is this patch based on code review or do you saw this issue on any of the
arm/ppc target? arm64 will have performance impact with this change.

Based on mbuf_autotest, the rte_panic will be invoked in seconds.

PANIC in test_refcnt_iter():
(lcore=0, iter=0): after 10s only 61 of 64 mbufs left free
1: [./test(rte_dump_stack+0x38) [0x58d868]]
Aborted (core dumped)

Cheers,
Jia

Konstantin

Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue

2017-10-12 Thread Jia He




On 10/13/2017 9:02 AM, Jia He Wrote:

Hi Jerin


On 10/13/2017 1:23 AM, Jerin Jacob Wrote:

-Original Message-

Date: Thu, 12 Oct 2017 17:05:50 +


[...]

On the same lines,

Jia He, jie2.liu, bing.zhao,

Is this patch based on code review or do you saw this issue on any of 
the

arm/ppc target? arm64 will have performance impact with this change.

sorry, miss one important information
Our platform is an aarch64 server with 46 cpus.
If we reduced the involved cpu numbers, the bug occurred less frequently.

Yes, mb barrier impact the performance, but correctness is more 
important, isn't it ;-)

Maybe we can  find any other lightweight barrier here?

Cheers,
Jia

Based on mbuf_autotest, the rte_panic will be invoked in seconds.

PANIC in test_refcnt_iter():
(lcore=0, iter=0): after 10s only 61 of 64 mbufs left free
1: [./test(rte_dump_stack+0x38) [0x58d868]]
Aborted (core dumped)

Cheers,
Jia




Konstantin

Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue

2017-10-12 Thread Jia He


Hi


On 10/13/2017 9:02 AM, Jia He Wrote:

Hi Jerin


On 10/13/2017 1:23 AM, Jerin Jacob Wrote:

-Original Message-

Date: Thu, 12 Oct 2017 17:05:50 +


[...]

On the same lines,

Jia He, jie2.liu, bing.zhao,

Is this patch based on code review or do you saw this issue on any of 
the

arm/ppc target? arm64 will have performance impact with this change.

sorry, miss one important information
Our platform is an aarch64 server with 46 cpus.
If we reduced the involved cpu numbers, the bug occurred less frequently.

Yes, mb barrier impact the performance, but correctness is more 
important, isn't it ;-)

Maybe we can  find any other lightweight barrier here?

Cheers,
Jia

Based on mbuf_autotest, the rte_panic will be invoked in seconds.

PANIC in test_refcnt_iter():
(lcore=0, iter=0): after 10s only 61 of 64 mbufs left free
1: [./test(rte_dump_stack+0x38) [0x58d868]]
Aborted (core dumped)

Cheers,
Jia




Konstantin

Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue

2017-10-12 Thread Jia He

Hi Jerin

On 10/13/2017 9:49 AM, Jerin Jacob Wrote:

-Original Message-

Date: Fri, 13 Oct 2017 09:16:31 +0800
From: Jia He 
To: Jerin Jacob , "Ananyev, Konstantin"

Cc: Olivier MATZ , "dev@dpdk.org" ,
  "jia...@hxt-semitech.com" ,
  "jie2@hxt-semitech.com" ,
  "bing.z...@hxt-semitech.com" 
Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading when
  doing enqueue/dequeue
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
  Thunderbird/52.3.0

Hi

On 10/13/2017 9:02 AM, Jia He Wrote:

Hi Jerin

On 10/13/2017 1:23 AM, Jerin Jacob Wrote:

-Original Message-

Date: Thu, 12 Oct 2017 17:05:50 +

[...]

On the same lines,

Jia He, jie2.liu, bing.zhao,

Is this patch based on code review or do you saw this issue on any
of the
arm/ppc target? arm64 will have performance impact with this change.

sorry, miss one important information
Our platform is an aarch64 server with 46 cpus.

Is this an OOO(Out of order execution) aarch64 CPU implementation?

I think so, it is a server cpu (ARMv8-A), but do you know how to confirm it?
cat /proc/cpuinfo
processor   : 0
BogoMIPS    : 40.00
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid asimdrdm
CPU implementer : 0x51
CPU architecture: 8
CPU variant : 0x0
CPU part    : 0x800
CPU revision    : 0

If we reduced the involved cpu numbers, the bug occurred less frequently.

Yes, mb barrier impact the performance, but correctness is more important,
isn't it ;-)

Yes.

Maybe we can  find any other lightweight barrier here?

Yes, Regarding the lightweight barrier, arm64 has native support for acquire 
and release
semantics, which is exposed through gcc as architecture agnostic
functions.
https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
http://preshing.com/20130922/acquire-and-release-fences/

Good to know,
1) How much overhead this patch in your platform? Just relative
numbers are enough

I create a *standalone* test case for test_mbuf
Attached the debug patch
It is hard to believe but the truth is that the performance after adding 
rmb barrier

is better than no adding.

With this patch (4 times running)
time ./test_good --no-huge -l 1-20
real    0m23.311s
user    7m21.870s
sys 0m0.021s

time ./test_bad --no-huge -l 1-20
Without this patch
real    0m38.972s
user    12m35.271s
sys 0m0.030s

Cheers,
Jia

2) As a prototype, Is Changing to acquire and release schematics
reduces the overhead in your platform?

Reference FreeBSD ring/DPDK style ring implementation through acquire
and release schematics
https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c

I will also spend on cycles on this.

Cheers,
Jia

Based on mbuf_autotest, the rte_panic will be invoked in seconds.

PANIC in test_refcnt_iter():
(lcore=0, iter=0): after 10s only 61 of 64 mbufs left free
1: [./test(rte_dump_stack+0x38) [0x58d868]]
Aborted (core dumped)

Cheers,
Jia

Konstantin

diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 5e9b3b7..23168e7 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -517,6 +517,7 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
n = max;

*old_head = r->cons.head;
+   rte_smp_rmb();
const uint32_t prod_tail = r->prod.tail;
/* The subtraction is done between two unsigned 32bits value
 * (the result is always modulo 32 bits even if we have
diff --git a/test/test/test.c b/test/test/test.c
index 9accbd1..6910b06 100644
--- a/test/test/test.c
+++ b/test/test/test.c
@@ -61,6 +61,7 @@ extern cmdline_parse_ctx_t main_ctx[];

 #include "test.h"

+extern int test_mbuf(void);
 #define RTE_LOGTYPE_APP RTE_LOGTYPE_USER1

 const char *prgname; /* to be set to argv[0] */
@@ -135,6 +136,9 @@ main(int argc, char **argv)
RTE_LOG(INFO, APP,
"HPET is not enabled, using TSC as default 
timer\n");

+   test_mbuf();
+   printf("test_mbuf done\n");
+   return 0;

 #ifdef RTE_LIBRTE_CMDLINE
cl = cmdline_stdin_new(main_ctx, "RTE>>");
diff --git a/test/test/test_mbuf.c b/test/test/test_mbuf.c
index 3396b4a..7277260 100644
--- a/test/test/test_mbuf.c
+++ b/test/test/test_mbuf.c
@@ -59,7 +59,6 @@
 #include 

 #include "test.h"
-
 #define MBUF_DATA_SIZE  2048
 #define NB_MBUF 128
 #define MBUF_TEST_DATA_LEN  1464
@@ -85,7 +84,7 @@

 static volatile uint32_t refcnt_stop_slaves;
 static unsigned refcnt_lcore[RTE_MAX_LCORE];
-
+int test_mbuf(void);
 #endif

 /*
@@ -718,7 +717,7 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter,
for (i = 0, n = rte_mempool_avail_count(refcnt_pool);
i != n && (m = rte_pktmbuf_alloc(refcnt_pool)) != NULL;
i++) {
-

Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue

2017-10-13 Thread Jia He


Hi


On 10/13/2017 3:33 PM, Jianbo Liu Wrote:

The 10/13/2017 07:19, Jerin Jacob wrote:

-Original Message-

Date: Fri, 13 Oct 2017 09:16:31 +0800
From: Jia He 
To: Jerin Jacob , "Ananyev, Konstantin"
  
Cc: Olivier MATZ , "dev@dpdk.org" ,
  "jia...@hxt-semitech.com" ,
  "jie2@hxt-semitech.com" ,
  "bing.z...@hxt-semitech.com" 
Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading when
  doing enqueue/dequeue
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
  Thunderbird/52.3.0

Hi


On 10/13/2017 9:02 AM, Jia He Wrote:

Hi Jerin


On 10/13/2017 1:23 AM, Jerin Jacob Wrote:

-Original Message-

Date: Thu, 12 Oct 2017 17:05:50 +


[...]

On the same lines,

Jia He, jie2.liu, bing.zhao,

Is this patch based on code review or do you saw this issue on any
of the
arm/ppc target? arm64 will have performance impact with this change.

sorry, miss one important information
Our platform is an aarch64 server with 46 cpus.

Is this an OOO(Out of order execution) aarch64 CPU implementation?


If we reduced the involved cpu numbers, the bug occurred less frequently.

Yes, mb barrier impact the performance, but correctness is more important,
isn't it ;-)

Yes.


Maybe we can  find any other lightweight barrier here?

Yes, Regarding the lightweight barrier, arm64 has native support for acquire 
and release
semantics, which is exposed through gcc as architecture agnostic
functions.
https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
http://preshing.com/20130922/acquire-and-release-fences/

Good to know,
1) How much overhead this patch in your platform? Just relative
numbers are enough
2) As a prototype, Is Changing to acquire and release schematics
reduces the overhead in your platform?


+1, can you try what ODP does in the link mentioned below?

Sure, pls see the result:
root@server:~/odp/test/linux-generic/ring# ./ring_main
HW time counter freq: 2000 hz

_ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory
_ishm.c:880:_odp_ishm_reserve():No huge pages, fall back to normal 
pages. check: /proc/sys/vm/nr_hugepages.

_ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory
_ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory
_ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory
 PKTIO: initialized loop interface.
 PKTIO: initialized ipc interface.
 PKTIO: initialized socket mmap, use export 
ODP_PKTIO_DISABLE_SOCKET_MMAP=1 to disable.
 PKTIO: initialized socket mmsg,use export 
ODP_PKTIO_DISABLE_SOCKET_MMSG=1 to disable.

_ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory
_ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory
_ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory
    ODP API version: 1.15.0
    ODP implementation name:    "odp-linux"
    ODP implementation version: "odp-linux" 1.15.0-0 (v1.15.0) 1.15.0.0


 CUnit - A unit testing framework for C - Version 2.1-3
 http://cunit.sourceforge.net/


Suite: ring basic
  Test: ring_test_basic_create 
...pktio/ring.c:177:_ring_create():Requested size is invalid, must be 
power of 2,and do not exceed the size limit 268435455

_ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory
_ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory
passed
  Test: ring_test_basic_burst ...passed
  Test: ring_test_basic_bulk ...passed
  Test: ring_test_basic_watermark 
...passed_ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate 
memory


Suite: ring stress
  Test: ring_test_stress_1_1_producer_consumer ...passed
  Test: ring_test_stress_1_N_producer_consumer ...passed
  Test: ring_test_stress_N_1_producer_consumer ...passed
  Test: ring_test_stress_N_M_producer_consumer ...



Cheers,
Jia


Reference FreeBSD ring/DPDK style ring implementation through acquire
and release schematics
https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c

I will also spend on cycles on this.



Cheers,
Jia

Based on mbuf_autotest, the rte_panic will be invoked in seconds.

PANIC in test_refcnt_iter():
(lcore=0, iter=0): after 10s only 61 of 64 mbufs left free
1: [./test(rte_dump_stack+0x38) [0x58d868]]
Aborted (core dumped)

Cheers,
Jia



Konstantin

--
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.

Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue

2017-10-19 Thread Jia He




On 10/20/2017 4:02 AM, Ananyev, Konstantin Wrote:



-Original Message-
From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Ananyev, Konstantin
Sent: Thursday, October 19, 2017 3:15 PM
To: Zhao, Bing ; Jia He ; Jerin Jacob 

Cc: Olivier MATZ ; dev@dpdk.org; 
jia...@hxt-semitech.com; jie2@hxt-semitech.com; bing.zhao@hxt-
semitech.com
Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading 
when doing enqueue/dequeue


Hi,

On 2017/10/19 18:02, Ananyev, Konstantin wrote:

Hi Jia,


Hi


On 10/13/2017 9:02 AM, Jia He Wrote:

Hi Jerin


On 10/13/2017 1:23 AM, Jerin Jacob Wrote:

-Original Message-

Date: Thu, 12 Oct 2017 17:05:50 +


[...]

On the same lines,

Jia He, jie2.liu, bing.zhao,

Is this patch based on code review or do you saw this issue on any of
the
arm/ppc target? arm64 will have performance impact with this change.

sorry, miss one important information
Our platform is an aarch64 server with 46 cpus.
If we reduced the involved cpu numbers, the bug occurred less frequently.

Yes, mb barrier impact the performance, but correctness is more
important, isn't it ;-)
Maybe we can  find any other lightweight barrier here?

Cheers,
Jia

Based on mbuf_autotest, the rte_panic will be invoked in seconds.

PANIC in test_refcnt_iter():
(lcore=0, iter=0): after 10s only 61 of 64 mbufs left free
1: [./test(rte_dump_stack+0x38) [0x58d868]]
Aborted (core dumped)


So is it only reproducible with mbuf refcnt test?
Could it be reproduced with some 'pure' ring test
(no mempools/mbufs refcnt, etc.)?
The reason I am asking - in that test we also have mbuf refcnt updates
(that's what for that test was created) and we are doing some optimizations 
here too
to avoid excessive atomic updates.
BTW, if the problem is not reproducible without mbuf refcnt,
can I suggest to extend the test  with:
- add a check that enqueue() operation was successful
- walk through the pool and check/printf refcnt of each mbuf.
Hopefully that would give us some extra information what is going wrong here.
Konstantin



Currently, the issue is only found in this case here on the ARM
platform, not sure how it is going with the X86_64 platform

I understand that it is only reproducible on arm so far.
What I am asking - with dpdk is there any other way to reproduce it (on arm)
except then running mbuf_autotest?
Something really simple that not using mbuf/mempool etc?
Just do dequeue/enqueue from multiple threads and check data integrity at the 
end?
If not  - what makes you think that the problem is precisely in rte_ring code?
Why not in rte_mbuf let say?

Actually I reread your original mail and finally get your point.
If I understand you correctly the problem with read reordering here is that
after we read prot.tail but before we read cons.head
both cons.head and prod.tail might be updated,
but for us prod.tail change might be unnoticed.
As an example:
time 0 (cons.head == 0, prod.tail == 0):
prod_tail = r->prod.tail; /* due read reordering */
/* prod_tail == 0 */

  time 1 (cons.head ==5, prod.tail == 5):
*old_head = r->cons.head;
/* cons.head == 5 */
*entries = (prod_tail - *old_head);
/* *entries == (0 - 5) == 0xFFFB */

And we will move our cons.head forward, even though there are no filled entries 
in the ring.
Is that right?

Yes

As I side notice, I wonder why we have here:
*entries = (prod_tail - *old_head);
instead of:
*entries = r->size + prod_tail - *old_head;
?

Yes, I agree with you at this code line.
But reordering will still mess up things even after this change(I have 
tested, still the same as before)
I thought the *entries is a door to prevent consumer from moving forward 
too fast than the producer.
But in some cases, it is correct that prod_tail is smaller than 
*old_head due to  the cirular queue.

In other cases, it is incorrect because of read/read reordering.

AFAICT, the root cause here is the producer tail and cosumer head are 
dependant on each other.

Thus a memory barrier is neccessary.

Cheers,
Jia



Konstantin


. In another
mail of this thread, we've made a simple test based on this and captured
some information and I pasted there.(I pasted the patch there :-))

Are you talking about that one:
http://dpdk.org/dev/patchwork/patch/30405/
?
It still uses test/test/test_mbuf.c...,
but anyway I don't really understand how mbuf_autotest supposed
to work with these changes:
@@ -730,7 +739,7 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter,
rte_ring_enqueue(refcnt_mbuf_ring, m);
  }
  }
-   rte_pktmbuf_free(m);
+   // rte_pktmbuf_free(m);
  }
@@ -741,6 +750,12 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter,
  while (!rte_ring_empty(refcnt_mbuf_ring))
  ;

+   if (NULL != m) {
+   if (1 != rte_mbuf_refcnt_read(m))
+   printf("m ref is %u\n", rte_mbuf_re

Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue

2017-10-23 Thread Jia He


Hi Jerin


On 10/20/2017 1:43 PM, Jerin Jacob Wrote:

-Original Message-



[...]

dependant on each other.
Thus a memory barrier is neccessary.

Yes. The barrier is necessary.
In fact, upstream freebsd fixed this issue for arm64. DPDK ring
implementation is derived from freebsd's buf_ring.h.
https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166

I think, the only outstanding issue is, how to reduce the performance
impact for arm64. I believe using accurate/release semantics instead
of rte_smp_rmb() will reduce the performance overhead like similar ring 
implementations below,
freebsd: https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166
odp: 
https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c

Jia,
1) Can you verify the use of accurate/release semantics fixes the problem in 
your
platform? like use of atomic_load_acq* in the reference code.
2) If so, What is the overhead between accurate/release and plane smp_smb()
barriers. Based on that we need decide what path to take.
I've tested 3 cases.  The new 3rd case is to use the load_acquire 
barrier (half barrier) you mentioned

at above link.
The patch seems like:
@@ -408,8 +466,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
    /* Reset n to the initial burst count */
    n = max;

-   *old_head = r->prod.head;
-   const uint32_t cons_tail = r->cons.tail;
+   *old_head = atomic_load_acq_32(&r->prod.head);
+   const uint32_t cons_tail = 
atomic_load_acq_32(&r->cons.tail);


@@ -516,14 +576,15 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_s
    /* Restore n as it may change every loop */
    n = max;

-   *old_head = r->cons.head;
-   const uint32_t prod_tail = r->prod.tail;
+   *old_head = atomic_load_acq_32(&r->cons.head);
+   const uint32_t prod_tail = atomic_load_acq_32(&r->prod.tail)
    /* The subtraction is done between two unsigned 32bits 
value

 * (the result is always modulo 32 bits even if we have
 * cons_head > prod_tail). So 'entries' is always between 0
 * and size(ring)-1. */

The half barrier patch passed the fuctional test.

As for the performance comparision on *arm64*(the debug patch is at
http://dpdk.org/ml/archives/dev/2017-October/079012.html), please see 
the test results

below:

[case 1] old codes, no barrier

 Performance counter stats for './test --no-huge -l 1-10':

 689275.001200  task-clock (msec) #    9.771 CPUs utilized
  6223  context-switches  #    0.009 K/sec
    10  cpu-migrations    #    0.000 K/sec
   653  page-faults   #    0.001 K/sec
 1721190914583  cycles    #    2.497 GHz
 3363238266430  instructions  #    1.95  insn per 
cycle

    branches
  27804740  branch-misses #    0.00% of all 
branches


  70.540618825 seconds time elapsed

[case 2] full barrier with rte_smp_rmb()

 Performance counter stats for './test --no-huge -l 1-10':

 582557.895850  task-clock (msec) #    9.752 CPUs utilized
  5242  context-switches  #    0.009 K/sec
    10  cpu-migrations    #    0.000 K/sec
   665  page-faults   #    0.001 K/sec
 1454360730055  cycles    #    2.497 GHz
  587197839907  instructions  #    0.40  insn per 
cycle

    branches
  27799687  branch-misses #    0.00% of all 
branches


  59.735582356 seconds time elapse

[case 1] half barrier with load_acquire

 Performance counter stats for './test --no-huge -l 1-10':

 660758.877050  task-clock (msec) #    9.764 CPUs utilized
  5982  context-switches  #    0.009 K/sec
    11  cpu-migrations    #    0.000 K/sec
   657  page-faults   #    0.001 K/sec
 1649875318044  cycles    #    2.497 GHz
  591583257765  instructions  #    0.36  insn per 
cycle

    branches
  27994903  branch-misses #    0.00% of all 
branches


  67.672855107 seconds time elapsed

Please see the context-switches in the perf results
test result  sorted by time is:
full barrier < half barrier < no barrier

AFAICT, in this case ,the cpu reordering will add the possibility for 
context switching and

increase the running time.

Any ideas?

Cheers,
Jia



Note:
This issue wont come in all the arm64 implementation. it comes on arm64
implementation with OOO(out of order) implementations.



Cheers,
Jia

Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue

2017-10-23 Thread Jia He


Hi Jerin


On 10/23/2017 6:06 PM, Jerin Jacob Wrote:

-Original Message-

Date: Mon, 23 Oct 2017 16:49:01 +0800
From: Jia He 
To: Jerin Jacob 
Cc: "Ananyev, Konstantin" , "Zhao, Bing"
  , Olivier MATZ ,
  "dev@dpdk.org" , "jia...@hxt-semitech.com"
  , "jie2@hxt-semitech.com"
  , "bing.z...@hxt-semitech.com"
  , "Richardson, Bruce"
  
Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod
  loading when doing enqueue/dequeue
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
  Thunderbird/52.4.0

Hi Jerin


On 10/20/2017 1:43 PM, Jerin Jacob Wrote:

-Original Message-

[...]

dependant on each other.
Thus a memory barrier is neccessary.

Yes. The barrier is necessary.
In fact, upstream freebsd fixed this issue for arm64. DPDK ring
implementation is derived from freebsd's buf_ring.h.
https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166

I think, the only outstanding issue is, how to reduce the performance
impact for arm64. I believe using accurate/release semantics instead
of rte_smp_rmb() will reduce the performance overhead like similar ring 
implementations below,
freebsd: https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166
odp: 
https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c

Jia,
1) Can you verify the use of accurate/release semantics fixes the problem in 
your
platform? like use of atomic_load_acq* in the reference code.
2) If so, What is the overhead between accurate/release and plane smp_smb()
barriers. Based on that we need decide what path to take.

I've tested 3 cases.  The new 3rd case is to use the load_acquire barrier
(half barrier) you mentioned
at above link.
The patch seems like:
@@ -408,8 +466,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
     /* Reset n to the initial burst count */
     n = max;

-   *old_head = r->prod.head;
-   const uint32_t cons_tail = r->cons.tail;
+   *old_head = atomic_load_acq_32(&r->prod.head);
+   const uint32_t cons_tail =
atomic_load_acq_32(&r->cons.tail);

@@ -516,14 +576,15 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_s
     /* Restore n as it may change every loop */
     n = max;

-   *old_head = r->cons.head;
-   const uint32_t prod_tail = r->prod.tail;
+   *old_head = atomic_load_acq_32(&r->cons.head);
+   const uint32_t prod_tail = atomic_load_acq_32(&r->prod.tail)
     /* The subtraction is done between two unsigned 32bits value
  * (the result is always modulo 32 bits even if we have
  * cons_head > prod_tail). So 'entries' is always between 0
  * and size(ring)-1. */

The half barrier patch passed the fuctional test.

As for the performance comparision on *arm64*(the debug patch is at
http://dpdk.org/ml/archives/dev/2017-October/079012.html), please see the
test results
below:

[case 1] old codes, no barrier

  Performance counter stats for './test --no-huge -l 1-10':

  689275.001200  task-clock (msec) #    9.771 CPUs utilized
   6223  context-switches  #    0.009 K/sec
     10  cpu-migrations    #    0.000 K/sec
    653  page-faults   #    0.001 K/sec
  1721190914583  cycles    #    2.497 GHz
  3363238266430  instructions  #    1.95  insn per cycle
     branches
   27804740  branch-misses #    0.00% of all branches

   70.540618825 seconds time elapsed

[case 2] full barrier with rte_smp_rmb()

  Performance counter stats for './test --no-huge -l 1-10':

  582557.895850  task-clock (msec) #    9.752 CPUs utilized
   5242  context-switches  #    0.009 K/sec
     10  cpu-migrations    #    0.000 K/sec
    665  page-faults   #    0.001 K/sec
  1454360730055  cycles    #    2.497 GHz
   587197839907  instructions  #    0.40  insn per cycle
     branches
   27799687  branch-misses #    0.00% of all branches

   59.735582356 seconds time elapse

[case 1] half barrier with load_acquire

  Performance counter stats for './test --no-huge -l 1-10':

  660758.877050  task-clock (msec) #    9.764 CPUs utilized
   5982  context-switches  #    0.009 K/sec
     11  cpu-migrations    #    0.000 K/sec
    657  page-faults

Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue

2017-10-25 Thread Jia He


Hi Jerin


On 10/25/2017 9:26 PM, Jerin Jacob Wrote:

-Original Message-

Date: Tue, 24 Oct 2017 10:04:26 +0800
From: Jia He 
To: Jerin Jacob 
Cc: "Ananyev, Konstantin" , "Zhao, Bing"
  , Olivier MATZ ,
  "dev@dpdk.org" , "jia...@hxt-semitech.com"
  , "jie2@hxt-semitech.com"
  , "bing.z...@hxt-semitech.com"
  , "Richardson, Bruce"
  
Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod
  loading when doing enqueue/dequeue
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
  Thunderbird/52.4.0

Hi Jerin

Hi Jia,



example:
./build/app/test -c 0xff -n 4

ring_perf_autotest

Seem in our arm64 server, the ring_perf_autotest will be finished in a few
seconds:

Yes. It just need a few seconds.


Anything wrong about configuration or environment setup?

By default, arm64+dpdk will be using el0 counter to measure the cycles. I
think, in your SoC, it will be running at 50MHz or 100MHz.So, You can
follow the below scheme to get accurate cycle measurement scheme:

See: http://dpdk.org/doc/guides/prog_guide/profile_app.html
check: 44.2.2. High-resolution cycle counter

Thank you for the suggestions.
But I tried your provided ko module to enable the accurate cycle 
measurement in user space, the

test result is as below:

root@nfv-demo01:~/dpdk/build/build/test/test# lsmod |grep pmu
pmu_el0_cycle_counter   262144  0
[old codes, without any patches]

RTE>>ring_perf_autotest
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 0
MP/MC single enq/dequeue: 0
SP/SC burst enq/dequeue (size: 8): 0
MP/MC burst enq/dequeue (size: 8): 0
SP/SC burst enq/dequeue (size: 32): 0
MP/MC burst enq/dequeue (size: 32): 0

### Testing empty dequeue ###
SC empty dequeue: 0.00
MC empty dequeue: 0.00

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 0.00
MP/MC bulk enq/dequeue (size: 8): 0.00
SP/SC bulk enq/dequeue (size: 32): 0.00
MP/MC bulk enq/dequeue (size: 32): 0.00

### Testing using two hyperthreads ###
SP/SC bulk enq/dequeue (size: 8): 0.00
MP/MC bulk enq/dequeue (size: 8): 0.00
SP/SC bulk enq/dequeue (size: 32): 0.00
MP/MC bulk enq/dequeue (size: 32): 0.00
Test OK

[with full rte_smp_rmb barrier patch]
==
RTE>>ring_perf_autotest
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 0
MP/MC single enq/dequeue: 0
SP/SC burst enq/dequeue (size: 8): 0
MP/MC burst enq/dequeue (size: 8): 0
SP/SC burst enq/dequeue (size: 32): 0
MP/MC burst enq/dequeue (size: 32): 0

### Testing empty dequeue ###
SC empty dequeue: 0.00
MC empty dequeue: 0.00

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 0.00
MP/MC bulk enq/dequeue (size: 8): 0.00
SP/SC bulk enq/dequeue (size: 32): 0.00
MP/MC bulk enq/dequeue (size: 32): 0.00

### Testing using two hyperthreads ###
SP/SC bulk enq/dequeue (size: 8): 0.00
MP/MC bulk enq/dequeue (size: 8): 0.00
SP/SC bulk enq/dequeue (size: 32): 0.00
MP/MC bulk enq/dequeue (size: 32): 0.00
Test OK
RTE>>

No difference，all time is 0 ?

If I rmmod pmu_el0_cycle_counter and revise the ./build/.config to 
comment the config line

#CONFIG_RTE_ARM_EAL_RDTSC_USE_PMU=y

Then the time is bigger than 0


root@ubuntu:/home/hj/dpdk/build/build/test/test# ./test -c 0xff -n 4
EAL: Detected 44 lcore(s)
EAL: Probing VFIO support...
APP: HPET is not enabled, using TSC as default timer
RTE>>per_lcore_autotest
RTE>>ring_perf_autotest
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 0
MP/MC single enq/dequeue: 2
SP/SC burst enq/dequeue (size: 8): 0

If you follow the above link, The value '0' will be replaced with more meaning 
full data.


MP/MC burst enq/dequeue (size: 8): 0
SP/SC burst enq/dequeue (size: 32): 0
MP/MC burst enq/dequeue (size: 32): 0

### Testing empty dequeue ###
SC empty dequeue: 0.02
MC empty dequeue: 0.04

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 0.12
MP/MC bulk enq/dequeue (size: 8): 0.31
SP/SC bulk enq/dequeue (size: 32): 0.05
MP/MC bulk enq/dequeue (size: 32): 0.09

### Testing using two hyperthreads ###
SP/SC bulk enq/dequeue (size: 8): 0.12
MP/MC bulk enq/dequeue (size: 8): 0.39
SP/SC bulk enq/dequeue (size: 32): 0.04
MP/MC bulk enq/dequeue (size: 32): 0.12

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 0.37
MP/MC bulk enq/dequeue (size: 8): 0.92
SP/SC bulk enq/dequeue (size: 32): 0.12
MP/MC bulk enq/dequeue (size: 32): 0.26
Test OK
RTE>>

Cheers,
Jia

By default, arm64+dpdk will be using el0 counter to measure the cycles. I
think, in your SoC, it will be running at 50MHz or 100MHz.So, You can
follow the below scheme to get accurate cycle measurement scheme:

See: http://dpdk.org/doc/guides/prog_guide/profile_app.html
check: 44.2.2. High-resolution cycle counter


--
Cheers,
Jia

Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue

2017-10-30 Thread Jia He


Hi Jerin

Do you think  next step whether I need to implement the load_acquire 
half barrier as per freebsd


or find any other performance test case to compare the performance impact?
Thanks for any suggestions.

Cheers,
Jia

On 10/25/2017 9:26 PM, Jerin Jacob Wrote:

-Original Message-

Date: Tue, 24 Oct 2017 10:04:26 +0800
From: Jia He 
To: Jerin Jacob 
Cc: "Ananyev, Konstantin" , "Zhao, Bing"
  , Olivier MATZ ,
  "dev@dpdk.org" , "jia...@hxt-semitech.com"
  , "jie2@hxt-semitech.com"
  , "bing.z...@hxt-semitech.com"
  , "Richardson, Bruce"
  
Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod
  loading when doing enqueue/dequeue
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
  Thunderbird/52.4.0

Hi Jerin

Hi Jia,



example:
./build/app/test -c 0xff -n 4

ring_perf_autotest

Seem in our arm64 server, the ring_perf_autotest will be finished in a few
seconds:

Yes. It just need a few seconds.


Anything wrong about configuration or environment setup?

By default, arm64+dpdk will be using el0 counter to measure the cycles. I
think, in your SoC, it will be running at 50MHz or 100MHz.So, You can
follow the below scheme to get accurate cycle measurement scheme:

See: http://dpdk.org/doc/guides/prog_guide/profile_app.html
check: 44.2.2. High-resolution cycle counter


root@ubuntu:/home/hj/dpdk/build/build/test/test# ./test -c 0xff -n 4
EAL: Detected 44 lcore(s)
EAL: Probing VFIO support...
APP: HPET is not enabled, using TSC as default timer
RTE>>per_lcore_autotest
RTE>>ring_perf_autotest
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 0
MP/MC single enq/dequeue: 2
SP/SC burst enq/dequeue (size: 8): 0

If you follow the above link, The value '0' will be replaced with more meaning 
full data.


MP/MC burst enq/dequeue (size: 8): 0
SP/SC burst enq/dequeue (size: 32): 0
MP/MC burst enq/dequeue (size: 32): 0

### Testing empty dequeue ###
SC empty dequeue: 0.02
MC empty dequeue: 0.04

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 0.12
MP/MC bulk enq/dequeue (size: 8): 0.31
SP/SC bulk enq/dequeue (size: 32): 0.05
MP/MC bulk enq/dequeue (size: 32): 0.09

### Testing using two hyperthreads ###
SP/SC bulk enq/dequeue (size: 8): 0.12
MP/MC bulk enq/dequeue (size: 8): 0.39
SP/SC bulk enq/dequeue (size: 32): 0.04
MP/MC bulk enq/dequeue (size: 32): 0.12

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 0.37
MP/MC bulk enq/dequeue (size: 8): 0.92
SP/SC bulk enq/dequeue (size: 32): 0.12
MP/MC bulk enq/dequeue (size: 32): 0.26
Test OK
RTE>>

Cheers,
Jia

By default, arm64+dpdk will be using el0 counter to measure the cycles. I
think, in your SoC, it will be running at 50MHz or 100MHz.So, You can
follow the below scheme to get accurate cycle measurement scheme:

See: http://dpdk.org/doc/guides/prog_guide/profile_app.html
check: 44.2.2. High-resolution cycle counter


--
Cheers,
Jia

Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue

2017-10-31 Thread Jia He


Hi Jerin

Thanks for your suggestions. I will try to use config macro to let it be 
chosen by user.


I need to point out one possible issue in your load_acq/store_rel patch

at 
https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch


@@ -516,8 +541,13 @@ __rte_ring_move_cons_head(struct rte_ring *r, int 
is_sc,

     /* Restore n as it may change every loop */
     n = max;

+#if 0
     *old_head = r->cons.head;
     const uint32_t prod_tail = r->prod.tail;
+#else
+        *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_RELAXED);   
                   --[1]
+        const uint32_t prod_tail = __atomic_load_n(&r->prod.tail, 
__ATOMIC_ACQUIRE);   --[2]

+#endif

line [1] __ATOMIC_RELAXED is not enough for this case(tested in our 
ARM64 server).


line [2] __ATOMIC_ACQUIRE guarantee the 2nd load will not be reorded 
before the 1st load, but will not


guarantee the 1st load will not be reordered after the 2nd load. Please 
also refer to your mentioned freebsd implementation. They use 
__ATOMIC_ACQUIRE at line [1].


Should it be like instead?

+#else
+        *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE);
+        const uint32_t prod_tail = __atomic_load_n(&r->prod.tail, 
__ATOMIC_ACQUIRE);



Cheers,

Jia



On 10/31/2017 7:14 PM, Jerin Jacob Wrote:

-Original Message-

Date: Tue, 31 Oct 2017 10:55:15 +0800
From: Jia He 
To: Jerin Jacob 
Cc: "Ananyev, Konstantin" , "Zhao, Bing"
  , Olivier MATZ ,
  "dev@dpdk.org" , "jia...@hxt-semitech.com"
  , "jie2@hxt-semitech.com"
  , "bing.z...@hxt-semitech.com"
  , "Richardson, Bruce"
  
Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod
  loading when doing enqueue/dequeue
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
  Thunderbird/52.4.0

Hi Jerin

Hi Jia,


Do you think  next step whether I need to implement the load_acquire half
barrier as per freebsd

I did a quick prototype using C11 memory model(ACQUIRE/RELEASE) schematics
and tested on two arm64 platform in Cavium(Platform A: Non arm64 OOO machine)
and Platform B: arm64 OOO machine)

smp_rmb() performs better in Platform A:
acquire/release semantics perform better in platform B:

Here is the patch:
https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch

In terms of next step:
- I am not sure the cost associated with acquire/release semantics on x86 or 
ppc.
IMO, We need to have both options under conditional compilation
flags and let the target platform choose the best one.

Thoughts?

Here is the performance numbers:
- Both platforms are running at different frequency, So absolute numbers does 
not
   matter, Just check the relative numbers.

Platform A: Performance numbers:

no patch(Non arm64 OOO machine)
---

SP/SC single enq/dequeue: 40
MP/MC single enq/dequeue: 282
SP/SC burst enq/dequeue (size: 8): 11
MP/MC burst enq/dequeue (size: 8): 42
SP/SC burst enq/dequeue (size: 32): 8
MP/MC burst enq/dequeue (size: 32): 16

### Testing empty dequeue ###
SC empty dequeue: 8.01
MC empty dequeue: 11.01

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 11.30
MP/MC bulk enq/dequeue (size: 8): 42.85
SP/SC bulk enq/dequeue (size: 32): 8.25
MP/MC bulk enq/dequeue (size: 32): 16.46

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 20.62
MP/MC bulk enq/dequeue (size: 8): 56.30
SP/SC bulk enq/dequeue (size: 32): 10.94
MP/MC bulk enq/dequeue (size: 32): 18.66
Test OK

# smp_rmb() patch((Non OOO arm64 machine)
http://dpdk.org/dev/patchwork/patch/30029/
-

SP/SC single enq/dequeue: 42
MP/MC single enq/dequeue: 291
SP/SC burst enq/dequeue (size: 8): 12
MP/MC burst enq/dequeue (size: 8): 44
SP/SC burst enq/dequeue (size: 32): 8
MP/MC burst enq/dequeue (size: 32): 16

### Testing empty dequeue ###
SC empty dequeue: 13.01
MC empty dequeue: 15.01

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 11.60
MP/MC bulk enq/dequeue (size: 8): 44.32
SP/SC bulk enq/dequeue (size: 32): 8.60
MP/MC bulk enq/dequeue (size: 32): 16.50

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 20.95
MP/MC bulk enq/dequeue (size: 8): 56.90
SP/SC bulk enq/dequeue (size: 32): 10.90
MP/MC bulk enq/dequeue (size: 32): 18.78
Test OK
RTE>>

# c11 memory model patch((Non OOO arm64 machine)
https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch
-
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 197
MP/MC single enq/dequeue: 328
SP/SC burst enq/dequeue (size: 8): 31
MP/MC burst enq/dequeue (size: 8): 50
SP/SC burst enq/dequeue (size:

Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue

2017-10-31 Thread Jia He

Hi Jerin

On 10/31/2017 7:14 PM, Jerin Jacob Wrote:

-Original Message-

Date: Tue, 31 Oct 2017 10:55:15 +0800
From: Jia He 
To: Jerin Jacob 
Cc: "Ananyev, Konstantin" , "Zhao, Bing"
  , Olivier MATZ ,
  "dev@dpdk.org" , "jia...@hxt-semitech.com"
  , "jie2@hxt-semitech.com"
  , "bing.z...@hxt-semitech.com"
  , "Richardson, Bruce"

Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod
  loading when doing enqueue/dequeue
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
  Thunderbird/52.4.0

Hi Jerin

Hi Jia,

Do you think  next step whether I need to implement the load_acquire half
barrier as per freebsd

I did a quick prototype using C11 memory model(ACQUIRE/RELEASE) schematics
and tested on two arm64 platform in Cavium(Platform A: Non arm64 OOO machine)
and Platform B: arm64 OOO machine)
Can you elaborate anything about your Non arm64 OOO machine? As I know, 
all arm64 server is strong

memory order. Am I missed anything?

--
Cheers,
Jia

Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue

2017-11-01 Thread Jia He


Hi Jerin


On 11/2/2017 3:04 AM, Jerin Jacob Wrote:

Date: Thu, 2 Nov 2017 00:27:46 +0530
From: Jerin Jacob 
To: Jia He 
Cc: "Ananyev, Konstantin" ,
"Zhao, Bing" ,
Olivier MATZ ,
"dev@dpdk.org" ,
"jia...@hxt-semitech.com" ,
"jie2@hxt-semitech.com" ,
"bing.z...@hxt-semitech.com" ,
"Richardson, Bruce" ,
jianbo@arm.com, hemant.agra...@nxp.com
Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading
  when doing enqueue/dequeue
Message-ID: <20171101185723.GA18759@jerin>
References: 
<2601191342ceee43887bde71ab9772585faab...@irsmsx103.ger.corp.intel.com>
  <3e580cd7-2854-d855-be9c-7c4ce06e3...@gmail.com>
  <20171020054319.GA4249@jerin>
  
  <20171023100617.GA17957@jerin>
  
  <20171025132642.GA13977@jerin>
  
  <2017103433.GA21742@jerin>
  <69adfb00-4582-b362-0540-d1d9d6bcf...@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <69adfb00-4582-b362-0540-d1d9d6bcf...@gmail.com>
User-Agent: Mutt/1.9.1 (2017-09-22)

-Original Message-

Date: Wed, 1 Nov 2017 10:53:12 +0800
From: Jia He 
To: Jerin Jacob 
Cc: "Ananyev, Konstantin" , "Zhao, Bing"
  , Olivier MATZ ,
  "dev@dpdk.org" , "jia...@hxt-semitech.com"
  , "jie2@hxt-semitech.com"
  , "bing.z...@hxt-semitech.com"
  , "Richardson, Bruce"
  , jianbo@arm.com, hemant.agra...@nxp.com
Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod
  loading when doing enqueue/dequeue
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
  Thunderbird/52.4.0

Hi Jerin

Thanks for your suggestions. I will try to use config macro to let it be
chosen by user.

It is better, if we avoid #ifdef's in the common code. I think, you can
do the scheme like examples/l3fwd/l3fwd_em_hlm_neon.h. Where,
the common code will have all generic routine, difference will be
moved to a separate files to reduce #ifdef clutter.


I need to point out one possible issue in your load_acq/store_rel patch

at 
https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch

@@ -516,8 +541,13 @@ __rte_ring_move_cons_head(struct rte_ring *r, int
is_sc,
      /* Restore n as it may change every loop */
      n = max;

+#if 0
      *old_head = r->cons.head;
      const uint32_t prod_tail = r->prod.tail;
+#else
+        *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_RELAXED);
                --[1]
+        const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
__ATOMIC_ACQUIRE);   --[2]
+#endif

line [1] __ATOMIC_RELAXED is not enough for this case(tested in our ARM64
server).

line [2] __ATOMIC_ACQUIRE guarantee the 2nd load will not be reorded before
the 1st load, but will not

guarantee the 1st load will not be reordered after the 2nd load. Please also

For me it looks same. [1] can not cross [2] is same as [2] cannot cross
[1], if [1] are [2] at back to back. No ?

No,
IIUC, load_acquire(a) is equal to the pseudo codes:
load(a)
one-direction-barrier()

instead of
one-direction-barrier()
load(a)

Thus, in below cases, load(a) and load(b) can still be reordered, this 
is not the semantic violation of

the load_acquire:
load(a)
load_acquire(b)

IIUC, the orginal implementation in 
https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170 
is not optimal.

I tested the changes as follow and it works fine:

+        *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE);
         const uint32_t prod_tail = r->prod.tail;

i.e.
load_acquire(a)
load(b)


Cheers,
Jia

[dpdk-dev] [PATCH v2] ring: guarantee ordering of cons/prod loading when doing

2017-11-02 Thread Jia He

We watched a rte panic of mbuf_autotest in our qualcomm arm64 server.
As for the possible race condition, please refer to [1].

Furthermore, there are 2 options as suggested by Jerin:
1. use rte_smp_rmb
2. use load_acquire/store_release(refer to [2]).
CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER is provided, and by
default it is n;

The reason why providing 2 options is due to the performance benchmark
difference in different arm machines, please refer to [3].

Already fuctionally tested on the machines as follows:
on X86(passed the compilation)
on arm64 with CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER=y
on arm64 with CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER=n

[1] http://dpdk.org/ml/archives/dev/2017-October/078366.html
[2] https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170
[3] http://dpdk.org/ml/archives/dev/2017-October/080861.html

---
Changelog:
V2: let users choose whether using load_acquire/store_release
V1: rte_smp_rmb() between 2 loads

Signed-off-by: Jia He 
Signed-off-by: jie2@hxt-semitech.com
Signed-off-by: bing.z...@hxt-semitech.com
Signed-off-by: jia...@hxt-semitech.com
Suggested-by: jerin.ja...@caviumnetworks.com
---
 lib/librte_ring/Makefile   |  4 +++-
 lib/librte_ring/rte_ring.h | 38 --
 lib/librte_ring/rte_ring_arm64.h   | 48 ++
 lib/librte_ring/rte_ring_generic.h | 45 +++
 4 files changed, 127 insertions(+), 8 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_arm64.h
 create mode 100644 lib/librte_ring/rte_ring_generic.h

diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index e34d9d9..fa57a86 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -45,6 +45,8 @@ LIBABIVER := 1
 SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
 
 # install includes
-SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
+   rte_ring_arm64.h \
+   rte_ring_generic.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 5e9b3b7..943b1f9 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -103,6 +103,18 @@ extern "C" {
 #include 
 #include 
 
+/* In those strong memory models (e.g. x86), there is no need to add rmb()
+ * between load and load.
+ * In those weak models(powerpc/arm), there are 2 choices for the users
+ * 1.use rmb() memory barrier
+ * 2.use one-direcion load_acquire/store_release barrier
+ * It depends on performance test results. */
+#ifdef RTE_ARCH_ARM64
+#include "rte_ring_arm64.h"
+#else
+#include "rte_ring_generic.h"
+#endif
+
 #define RTE_TAILQ_RING_NAME "RTE_RING"
 
 enum rte_ring_queue_behavior {
@@ -368,7 +380,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, 
uint32_t new_val,
while (unlikely(ht->tail != old_val))
rte_pause();
 
-   ht->tail = new_val;
+   arch_rte_atomic_store(&ht->tail, new_val, __ATOMIC_RELEASE);
 }
 
 /**
@@ -408,7 +420,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
/* Reset n to the initial burst count */
n = max;
 
-   *old_head = r->prod.head;
+   *old_head = arch_rte_atomic_load(&r->prod.head,
+   __ATOMIC_ACQUIRE);
const uint32_t cons_tail = r->cons.tail;
/*
 *  The subtraction is done between two unsigned 32bits value
@@ -430,8 +443,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
if (is_sp)
r->prod.head = *new_head, success = 1;
else
-   success = rte_atomic32_cmpset(&r->prod.head,
-   *old_head, *new_head);
+   success = arch_rte_atomic32_cmpset(&r->prod.head,
+   old_head, *new_head,
+   0, __ATOMIC_ACQUIRE,
+   __ATOMIC_RELAXED);
} while (unlikely(success == 0));
return n;
 }
@@ -470,7 +485,10 @@ __rte_ring_do_enqueue(struct rte_ring *r, void * const 
*obj_table,
goto end;
 
ENQUEUE_PTRS(r, &r[1], prod_head, obj_table, n, void *);
+
+#ifndef RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER
rte_smp_wmb();
+#endif
 
update_tail(&r->prod, prod_head, prod_next, is_sp);
 end:
@@ -516,7 +534,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
/* Restore n as it may change every loop */
n = max;
 
-   *old_head = r->cons.head;
+   *old_head = arch_rte_atomic_load(&r->cons

Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue

2017-11-02 Thread Jia He



Hi， Jerin
please see my performance test below
On 11/2/2017 3:04 AM, Jerin Jacob Wrote:
[...]

Should it be like instead?

+#else
+        *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE);
+        const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
__ATOMIC_ACQUIRE);
It would be nice to see how much overhead it gives.ie back to back
__ATOMIC_ACQUIRE.
I can NOT test ring_perf_autotest in our server because of the something 
wrong in PMU counter.
All the return value of rte_rdtsc is 0 with and without your provided ko 
module. I am still

investigating the reason.

 I ever tested the difference with my debug patch, the difference is 
minor, less than +-1%


--
Cheers,
Jia





Cheers,

Jia



On 10/31/2017 7:14 PM, Jerin Jacob Wrote:

-Original Message-

Date: Tue, 31 Oct 2017 10:55:15 +0800
From: Jia He 
To: Jerin Jacob 
Cc: "Ananyev, Konstantin" , "Zhao, Bing"
   , Olivier MATZ ,
   "dev@dpdk.org" , "jia...@hxt-semitech.com"
   , "jie2@hxt-semitech.com"
   , "bing.z...@hxt-semitech.com"
   , "Richardson, Bruce"
   
Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod
   loading when doing enqueue/dequeue
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
   Thunderbird/52.4.0

Hi Jerin

Hi Jia,


Do you think  next step whether I need to implement the load_acquire half
barrier as per freebsd

I did a quick prototype using C11 memory model(ACQUIRE/RELEASE) schematics
and tested on two arm64 platform in Cavium(Platform A: Non arm64 OOO machine)
and Platform B: arm64 OOO machine)

smp_rmb() performs better in Platform A:
acquire/release semantics perform better in platform B:

Here is the patch:
https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch

In terms of next step:
- I am not sure the cost associated with acquire/release semantics on x86 or 
ppc.
IMO, We need to have both options under conditional compilation
flags and let the target platform choose the best one.

Thoughts?

Here is the performance numbers:
- Both platforms are running at different frequency, So absolute numbers does 
not
matter, Just check the relative numbers.

Platform A: Performance numbers:

no patch(Non arm64 OOO machine)
---

SP/SC single enq/dequeue: 40
MP/MC single enq/dequeue: 282
SP/SC burst enq/dequeue (size: 8): 11
MP/MC burst enq/dequeue (size: 8): 42
SP/SC burst enq/dequeue (size: 32): 8
MP/MC burst enq/dequeue (size: 32): 16

### Testing empty dequeue ###
SC empty dequeue: 8.01
MC empty dequeue: 11.01

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 11.30
MP/MC bulk enq/dequeue (size: 8): 42.85
SP/SC bulk enq/dequeue (size: 32): 8.25
MP/MC bulk enq/dequeue (size: 32): 16.46

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 20.62
MP/MC bulk enq/dequeue (size: 8): 56.30
SP/SC bulk enq/dequeue (size: 32): 10.94
MP/MC bulk enq/dequeue (size: 32): 18.66
Test OK

# smp_rmb() patch((Non OOO arm64 machine)
http://dpdk.org/dev/patchwork/patch/30029/
-

SP/SC single enq/dequeue: 42
MP/MC single enq/dequeue: 291
SP/SC burst enq/dequeue (size: 8): 12
MP/MC burst enq/dequeue (size: 8): 44
SP/SC burst enq/dequeue (size: 32): 8
MP/MC burst enq/dequeue (size: 32): 16

### Testing empty dequeue ###
SC empty dequeue: 13.01
MC empty dequeue: 15.01

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 11.60
MP/MC bulk enq/dequeue (size: 8): 44.32
SP/SC bulk enq/dequeue (size: 32): 8.60
MP/MC bulk enq/dequeue (size: 32): 16.50

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 20.95
MP/MC bulk enq/dequeue (size: 8): 56.90
SP/SC bulk enq/dequeue (size: 32): 10.90
MP/MC bulk enq/dequeue (size: 32): 18.78
Test OK
RTE>>

# c11 memory model patch((Non OOO arm64 machine)
https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch
-
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 197
MP/MC single enq/dequeue: 328
SP/SC burst enq/dequeue (size: 8): 31
MP/MC burst enq/dequeue (size: 8): 50
SP/SC burst enq/dequeue (size: 32): 13
MP/MC burst enq/dequeue (size: 32): 18

### Testing empty dequeue ###
SC empty dequeue: 13.01
MC empty dequeue: 18.02

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 30.95
MP/MC bulk enq/dequeue (size: 8): 50.30
SP/SC bulk enq/dequeue (size: 32): 13.27
MP/MC bulk enq/dequeue (size: 32): 18.11

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 43.38
MP/MC bulk enq/dequeue (size: 8): 64.42
SP/SC bulk enq/dequeue (size: 32): 16.71
MP/MC bulk enq/dequeue (size: 32): 22.21


Platform B: Performance numbers:
==

Re: [dpdk-dev] [PATCH v2] ring: guarantee ordering of cons/prod loading when doing

2017-11-02 Thread Jia He


Hi Ananyev


On 11/2/2017 9:26 PM, Ananyev, Konstantin Wrote:

Hi Jia,


-Original Message-
From: Jia He [mailto:hejia...@gmail.com]
Sent: Thursday, November 2, 2017 8:44 AM
To: jerin.ja...@caviumnetworks.com; dev@dpdk.org; olivier.m...@6wind.com
Cc: Ananyev, Konstantin ; Richardson, Bruce 
; jianbo@arm.com;
hemant.agra...@nxp.com; Jia He ; jie2@hxt-semitech.com; 
bing.z...@hxt-semitech.com; jia.he@hxt-
semitech.com
Subject: [PATCH v2] ring: guarantee ordering of cons/prod loading when doing

We watched a rte panic of mbuf_autotest in our qualcomm arm64 server.
As for the possible race condition, please refer to [1].

Furthermore, there are 2 options as suggested by Jerin:
1. use rte_smp_rmb
2. use load_acquire/store_release(refer to [2]).
CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER is provided, and by
default it is n;

The reason why providing 2 options is due to the performance benchmark
difference in different arm machines, please refer to [3].

Already fuctionally tested on the machines as follows:
on X86(passed the compilation)
on arm64 with CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER=y
on arm64 with CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER=n

[1] http://dpdk.org/ml/archives/dev/2017-October/078366.html
[2] https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170
[3] http://dpdk.org/ml/archives/dev/2017-October/080861.html

---
Changelog:
V2: let users choose whether using load_acquire/store_release
V1: rte_smp_rmb() between 2 loads

Signed-off-by: Jia He 
Signed-off-by: jie2@hxt-semitech.com
Signed-off-by: bing.z...@hxt-semitech.com
Signed-off-by: jia...@hxt-semitech.com
Suggested-by: jerin.ja...@caviumnetworks.com
---
  lib/librte_ring/Makefile   |  4 +++-
  lib/librte_ring/rte_ring.h | 38 --
  lib/librte_ring/rte_ring_arm64.h   | 48 ++
  lib/librte_ring/rte_ring_generic.h | 45 +++
  4 files changed, 127 insertions(+), 8 deletions(-)
  create mode 100644 lib/librte_ring/rte_ring_arm64.h
  create mode 100644 lib/librte_ring/rte_ring_generic.h

diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index e34d9d9..fa57a86 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -45,6 +45,8 @@ LIBABIVER := 1
  SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c

  # install includes
-SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
+   rte_ring_arm64.h \
+   rte_ring_generic.h

  include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 5e9b3b7..943b1f9 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -103,6 +103,18 @@ extern "C" {
  #include 
  #include 

+/* In those strong memory models (e.g. x86), there is no need to add rmb()
+ * between load and load.
+ * In those weak models(powerpc/arm), there are 2 choices for the users
+ * 1.use rmb() memory barrier
+ * 2.use one-direcion load_acquire/store_release barrier
+ * It depends on performance test results. */
+#ifdef RTE_ARCH_ARM64
+#include "rte_ring_arm64.h"
+#else
+#include "rte_ring_generic.h"
+#endif
+
  #define RTE_TAILQ_RING_NAME "RTE_RING"

  enum rte_ring_queue_behavior {
@@ -368,7 +380,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, 
uint32_t new_val,
while (unlikely(ht->tail != old_val))
rte_pause();

-   ht->tail = new_val;
+   arch_rte_atomic_store(&ht->tail, new_val, __ATOMIC_RELEASE);
  }

  /**
@@ -408,7 +420,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
/* Reset n to the initial burst count */
n = max;

-   *old_head = r->prod.head;
+   *old_head = arch_rte_atomic_load(&r->prod.head,
+   __ATOMIC_ACQUIRE);
const uint32_t cons_tail = r->cons.tail;

The code starts to look a bit messy with all these arch specific macros...
So I wonder wouldn't it be more cleaner to:

1. move existing __rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail
into rte_ring_generic.h
2. Add rte_smp_rmb into generic 
__rte_ring_move_prod_head/__rte_ring_move_cons_head
(as was in v1 of your patch).
3. Introduce ARM specific versions of 
__rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail
in the rte_ring_arm64.h

That way we will keep ogneric code simple and clean, while still allowing arch 
specific optimizations.

Thanks for your review.
But as per your suggestion, there will be at least 2 copies of 
__rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail.
Thus, if there are any bugs in the future, both 2 copies have to be 
changed, right?



/*

Re: [dpdk-dev] [PATCH v2] ring: guarantee ordering of cons/prod loading when doing

2017-11-02 Thread Jia He


Hi Jerin


On 11/3/2017 1:23 AM, Jerin Jacob Wrote:

-Original Message-

Date: Thu,  2 Nov 2017 08:43:30 +
From: Jia He 
To: jerin.ja...@caviumnetworks.com, dev@dpdk.org, olivier.m...@6wind.com
Cc: konstantin.anan...@intel.com, bruce.richard...@intel.com,
  jianbo@arm.com, hemant.agra...@nxp.com, Jia He ,
  jie2@hxt-semitech.com, bing.z...@hxt-semitech.com,
  jia...@hxt-semitech.com
Subject: [PATCH v2] ring: guarantee ordering of cons/prod loading when doing
X-Mailer: git-send-email 2.7.4

We watched a rte panic of mbuf_autotest in our qualcomm arm64 server.
As for the possible race condition, please refer to [1].

Hi Jia,

In addition to Konstantin comments. Please find below some review
comments.

Furthermore, there are 2 options as suggested by Jerin:
1. use rte_smp_rmb
2. use load_acquire/store_release(refer to [2]).
CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER is provided, and by

I think, The better name would be CONFIG_RTE_RING_USE_C11_MEM_MODEL
or something like that.

Ok, but how to distinguish following 2 options?

CONFIG_RTE_RING_USE_C11_MEM_MODEL doesn't seem to be enough

1. use rte_smp_rmb
2. use load_acquire/store_release(refer to [2]).


default it is n;

---
Changelog:
V2: let users choose whether using load_acquire/store_release
V1: rte_smp_rmb() between 2 loads

Signed-off-by: Jia He 
Signed-off-by: jie2@hxt-semitech.com
Signed-off-by: bing.z...@hxt-semitech.com
Signed-off-by: jia...@hxt-semitech.com
Suggested-by: jerin.ja...@caviumnetworks.com
---
  lib/librte_ring/Makefile   |  4 +++-
  lib/librte_ring/rte_ring.h | 38 --
  lib/librte_ring/rte_ring_arm64.h   | 48 ++
  lib/librte_ring/rte_ring_generic.h | 45 +++
  4 files changed, 127 insertions(+), 8 deletions(-)
  create mode 100644 lib/librte_ring/rte_ring_arm64.h
  create mode 100644 lib/librte_ring/rte_ring_generic.h

diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index e34d9d9..fa57a86 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -45,6 +45,8 @@ LIBABIVER := 1
  SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
  
  # install includes

-SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
+   rte_ring_arm64.h \

It is really not specific to arm64. We could rename it to rte_ring_c11_mem.h or
something like that to reflect the implementation based on c11 memory
model.



+   rte_ring_generic.h
  
  include $(RTE_SDK)/mk/rte.lib.mk

diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 5e9b3b7..943b1f9 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -103,6 +103,18 @@ extern "C" {
  #include 
  #include 
  
+/* In those strong memory models (e.g. x86), there is no need to add rmb()

+ * between load and load.
+ * In those weak models(powerpc/arm), there are 2 choices for the users
+ * 1.use rmb() memory barrier
+ * 2.use one-direcion load_acquire/store_release barrier
+ * It depends on performance test results. */
+#ifdef RTE_ARCH_ARM64

s/RTE_ARCH_ARM64/RTE_RING_USE_C11_MEM_MODEL and update the generic arm64 config.
By that way it can used by another architecture like ppc if they choose to do 
so.



+#include "rte_ring_arm64.h"
+#else
+#include "rte_ring_generic.h"
+#endif
+
  #define RTE_TAILQ_RING_NAME "RTE_RING"
  
  enum rte_ring_queue_behavior {

@@ -368,7 +380,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, 
uint32_t new_val,
while (unlikely(ht->tail != old_val))
rte_pause();
  
-	ht->tail = new_val;

+   arch_rte_atomic_store(&ht->tail, new_val, __ATOMIC_RELEASE);
  }
  
  /**

@@ -408,7 +420,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
/* Reset n to the initial burst count */
n = max;
  
-		*old_head = r->prod.head;

+   *old_head = arch_rte_atomic_load(&r->prod.head,
+   __ATOMIC_ACQUIRE);

Same as Konstantin comments, i.e move to this function to c11 memory model
header file


const uint32_t cons_tail = r->cons.tail;
/*
 *  The subtraction is done between two unsigned 32bits value
@@ -430,8 +443,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
if (is_sp)
r->prod.head = *new_head, success = 1;
else
-   success = rte_atomic32_cmpset(&r->prod.head,
-   *old_head, *new_head);
+   success = arch_rte_atomic32_cmpset(&r->prod.head,
+   old_head, *new_head,
+

Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue

2017-11-02 Thread Jia He


Hi Jerin


On 11/2/2017 4:57 PM, Jia He Wrote:


Hi， Jerin
please see my performance test below
On 11/2/2017 3:04 AM, Jerin Jacob Wrote:
[...]

Should it be like instead?

+#else
+        *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE);
+        const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
__ATOMIC_ACQUIRE);
It would be nice to see how much overhead it gives.ie back to back
__ATOMIC_ACQUIRE.
I can NOT test ring_perf_autotest in our server because of the 
something wrong in PMU counter.
All the return value of rte_rdtsc is 0 with and without your provided 
ko module. I am still

investigating the reason.



Hi Jerin

As for the root cause of rte_rdtsc issue, it might be due to the pmu 
counter frequency is too low


in our arm64 server("Amberwing" from qualcom)

[586990.057779] arch_timer_get_cntfrq()=2000

Only 20MHz instead of 100M/200MHz, and CNTFRQ_EL0 is not even writable 
in kernel space.


Maybe the code in ring_perf_autotest needs to be changed?

e.g.

    printf("SC empty dequeue: %.2F\n",
            (double)(sc_end-sc_start) / iterations);
    printf("MC empty dequeue: %.2F\n",
            (double)(mc_end-mc_start) / iterations);

Otherwise it is always 0 if the time difference divides by iterations.


--
Cheers,
Jia

Re: [dpdk-dev] [PATCH v2] ring: guarantee ordering of cons/prod loading when doing

2017-11-05 Thread Jia He


Hi Jerin


On 11/3/2017 8:56 PM, Jerin Jacob Wrote:

-Original Message-



[...]

g like that.
Ok, but how to distinguish following 2 options?

No clearly understood this question. For arm64 case, you can add
CONFIG_RTE_RING_USE_C11_MEM_MODEL=y in config/defconfig_arm64-armv8a-*

Sorry for my unclear expressions.
I mean there should be one additional config macro besides 
CONFIG_RTE_RING_USE_C11_MEM_MODEL

for users to choose?

i.e.
 - On X86:CONFIG_RTE_RING_USE_C11_MEM_MODEL=n
include rte_ring_generic.h, no changes
- On arm64，CONFIG_RTE_RING_USE_C11_MEM_MODEL=y
include rte_ring_c11_mem.h by default.
In rte_ring_c11_mem.h, implement new version of 
__rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail


Then, how to distinguish the option of using rte_smp_rmb() or 
__atomic_load/store_n()?


Thanks for the clarification.

--
Cheers,
Jia

Re: [dpdk-dev] [PATCH v2] ring: guarantee ordering of cons/prod loading when doing

2017-11-07 Thread Jia He




On 11/7/2017 12:36 PM, Jerin Jacob Wrote:

-Original Message-

On option could be to change the prototype of update_tail() and make
compiler accommodate it for zero cost for arm64(Which I think, it it the
case. But you can check the generated instructions)
If not, move, __rte_ring_do_dequeue() and __rte_ring_do_enqueue() instead of
__rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail()


➜ [master][dpdk.org] $ git diff
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 5e9b3b7b4..b32648825 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -358,8 +358,12 @@ void rte_ring_dump(FILE *f, const struct rte_ring
*r);
  
  static __rte_always_inline void

  update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t
new_val,
-   uint32_t single)
+   uint32_t single, const uint32_t enqueue)
  {
+   if (enqueue)
+   rte_smp_wmb();
+   else
+   rte_smp_rmb();
 /*
  * If there are other enqueues/dequeues in progress that
  * preceded us,
  * we need to wait for them to complete
@@ -470,9 +474,8 @@ __rte_ring_do_enqueue(struct rte_ring *r, void *
const *obj_table,
 goto end;
  
 ENQUEUE_PTRS(r, &r[1], prod_head, obj_table, n, void *);

-   rte_smp_wmb();
  
-   update_tail(&r->prod, prod_head, prod_next, is_sp);

+   update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
  end:
 if (free_space != NULL)
 *free_space = free_entries - n;
@@ -575,9 +578,8 @@ __rte_ring_do_dequeue(struct rte_ring *r, void
**obj_table,
 goto end;
  
 DEQUEUE_PTRS(r, &r[1], cons_head, obj_table, n, void *);

-   rte_smp_rmb();
  
-   update_tail(&r->cons, cons_head, cons_next, is_sc);

+   update_tail(&r->cons, cons_head, cons_next, is_sc, 0);
  
  end:

 if (available != NULL)




Hi Jerin, yes I knew this suggestion in update_tail.
But what I mean is the rte_smp_rmb() in __rte_ring_move_cons_head and 
__rte_ring_move_pros_head:

[option 1]
+        *old_head = r->cons.head;
+        rte_smp_rmb();
+        const uint32_t prod_tail = r->prod.tail;

[option 2]
+        *old_head = __atomic_load_n(&r->cons.head,
+                    __ATOMIC_ACQUIRE);
+        *old_head = r->cons.head;

ie.I wonder what is the suitable new config name to distinguish the 
above 2 options?

Thanks for the patience :-)

see my drafted patch below, the marcro "PREFER":
diff --git a/lib/librte_ring/rte_ring_c11_mem.h 
b/lib/librte_ring/rte_ring_c11_mem.h

new file mode 100644
index 000..22fe887
--- /dev/null
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -0,0 +1,305 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2017 hxt-semitech. All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ *   notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ *   notice, this list of conditions and the following disclaimer in
+ *   the documentation and/or other materials provided with the
+ *   distribution.
+ * * Neither the name of hxt-semitech nor the names of its
+ *   contributors may be used to endorse or promote products derived
+ *   from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_RING_C11_MEM_H_
+#define _RTE_RING_C11_MEM_H_
+
+static __rte_always_inline void
+update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t 
new_val,

+        uint32_t single, uint32_t enqueue)
+{
+    /* Don't need wmb/rmb when we prefer to use load_acquire/
+     * store_release barrier */
+#ifndef PREFER
+    if (enqueue)
+        rte_smp_wmb();
+    else
+        rte_smp_rmb();
+#endif
+
+    /*
+     * If there are other enqueues/dequeues in progress that preceded us,
+     * we need to wait for them to complete
+     */
+    if (!single)
+        w

Re: [dpdk-dev] [PATCH v2] ring: guarantee ordering of cons/prod loading when doing

2017-11-07 Thread Jia He


Hi Jerin

Thank you,

I mistakenly think x86 doen't need rte_smp_rmb().

Since rte_smp_rmb() only impact x86's compiler optimization, I will 
simplify the codes as your suggestions


Cheers,

Jia


On 11/7/2017 5:57 PM, Jerin Jacob Wrote:

-Original Message-

Date: Tue, 7 Nov 2017 16:34:30 +0800
From: Jia He 
To: Jerin Jacob 
Cc: dev@dpdk.org, olivier.m...@6wind.com, konstantin.anan...@intel.com,
  bruce.richard...@intel.com, jianbo@arm.com, hemant.agra...@nxp.com,
  jie2@hxt-semitech.com, bing.z...@hxt-semitech.com,
  jia...@hxt-semitech.com
Subject: Re: [PATCH v2] ring: guarantee ordering of cons/prod loading when
  doing
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
  Thunderbird/52.4.0



On 11/7/2017 12:36 PM, Jerin Jacob Wrote:

-Original Message-

On option could be to change the prototype of update_tail() and make
compiler accommodate it for zero cost for arm64(Which I think, it it the
case. But you can check the generated instructions)
If not, move, __rte_ring_do_dequeue() and __rte_ring_do_enqueue() instead of
__rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail()


➜ [master][dpdk.org] $ git diff
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 5e9b3b7b4..b32648825 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -358,8 +358,12 @@ void rte_ring_dump(FILE *f, const struct rte_ring
*r);
   static __rte_always_inline void
   update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t
new_val,
-   uint32_t single)
+   uint32_t single, const uint32_t enqueue)
   {
+   if (enqueue)
+   rte_smp_wmb();
+   else
+   rte_smp_rmb();
  /*
   * If there are other enqueues/dequeues in progress that
   * preceded us,
   * we need to wait for them to complete
@@ -470,9 +474,8 @@ __rte_ring_do_enqueue(struct rte_ring *r, void *
const *obj_table,
  goto end;
  ENQUEUE_PTRS(r, &r[1], prod_head, obj_table, n, void *);
-   rte_smp_wmb();
-   update_tail(&r->prod, prod_head, prod_next, is_sp);
+   update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
   end:
  if (free_space != NULL)
  *free_space = free_entries - n;
@@ -575,9 +578,8 @@ __rte_ring_do_dequeue(struct rte_ring *r, void
**obj_table,
  goto end;
  DEQUEUE_PTRS(r, &r[1], cons_head, obj_table, n, void *);
-   rte_smp_rmb();
-   update_tail(&r->cons, cons_head, cons_next, is_sc);
+   update_tail(&r->cons, cons_head, cons_next, is_sc, 0);
   end:
  if (available != NULL)




Hi Jerin, yes I knew this suggestion in update_tail.
But what I mean is the rte_smp_rmb() in __rte_ring_move_cons_head and
__rte_ring_move_pros_head:
[option 1]
+        *old_head = r->cons.head;
+        rte_smp_rmb();
+        const uint32_t prod_tail = r->prod.tail;

[option 2]
+        *old_head = __atomic_load_n(&r->cons.head,
+                    __ATOMIC_ACQUIRE);
+        *old_head = r->cons.head;

ie.I wonder what is the suitable new config name to distinguish the above 2
options?

Why?
If you fix the generic version with rte_smp_rmb() then we just need only
one config to differentiate between c11 vs generic. See comments below,


Thanks for the patience :-)

see my drafted patch below, the marcro "PREFER":
+ */
+
+#ifndef _RTE_RING_C11_MEM_H_
+#define _RTE_RING_C11_MEM_H_
+
+static __rte_always_inline void
+update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t
new_val,
+        uint32_t single, uint32_t enqueue)
+{
+    /* Don't need wmb/rmb when we prefer to use load_acquire/
+     * store_release barrier */
+#ifndef PREFER
+    if (enqueue)
+        rte_smp_wmb();
+    else
+        rte_smp_rmb();
+#endif

You can remove PREFER and let the "generic" version has this. For x86,
rte_smp_?mb() it will be NOOP. So no issue.


+
+    /*
+     * If there are other enqueues/dequeues in progress that preceded us,
+     * we need to wait for them to complete
+     */
+    if (!single)
+        while (unlikely(ht->tail != old_val))
+            rte_pause();
+
+#ifdef PREFER
+    __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);

for c11 mem model version, it needs only __atomic_store_n version.


+#else
+    ht->tail = new_val;
+#endif
+}
+
+/**
+ * @internal This function updates the producer head for enqueue
+ *
+ * @param r
+ *   A pointer to the ring structure
+ * @param is_sp
+ *   Indicates whether multi-producer path is needed or not
+ * @param n
+ *   The number of elements we will want to enqueue, i.e. how far should
the
+ *   head be moved
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
+ * @param old_head
+ *   Returns h

[dpdk-dev] [PATCH 2/3] ring: guarantee load ordering of cons/prod when doing enqueue/dequeue

2017-11-07 Thread Jia He

We watched a rte panic of mbuf_autotest in our qualcomm arm64 server.
As for the possible race condition, please refer to [1].

Furthermore, to fix this race, there are 2 options as suggested by Jerin:
1. use rte_smp_rmb
2. use load_acquire/store_release(refer to [2]).
CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by
default it is "y" only on arm64.

The reason why providing 2 options is due to the performance benchmark
difference in different arm machines, please refer to [3].

Already fuctionally tested on the machines as follows:
- on X86
- on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=y
- on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=n

We haven't tested the ppc64 case. If anyone verifies it, he can add
CONFIG_RTE_RING_USE_C11_MEM_MODEL=y to ppc64 config files.

[1] http://dpdk.org/ml/archives/dev/2017-October/078366.html
[2] https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170
[3] http://dpdk.org/ml/archives/dev/2017-October/080861.html

---
Changelog:
V3: arch specific implementation for enqueue/dequeue barrier
V2: let users choose whether using load_acquire/store_release
V1: rte_smp_rmb() between 2 loads

Signed-off-by: Jia He 
Signed-off-by: jie2@hxt-semitech.com
Signed-off-by: bing.z...@hxt-semitech.com
Signed-off-by: jia...@hxt-semitech.com
Suggested-by: jerin.ja...@caviumnetworks.com
Suggested-by: konstantin.anan...@intel.com

---
 lib/librte_eventdev/rte_event_ring.h |   6 +-
 lib/librte_ring/Makefile |   4 +-
 lib/librte_ring/rte_ring.h   | 160 +++--
 lib/librte_ring/rte_ring_c11_mem.h   | 185 +
 lib/librte_ring/rte_ring_generic.h   | 192 +++
 5 files changed, 397 insertions(+), 150 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_c11_mem.h
 create mode 100644 lib/librte_ring/rte_ring_generic.h

diff --git a/lib/librte_eventdev/rte_event_ring.h 
b/lib/librte_eventdev/rte_event_ring.h
index ea9b688..3e49458 100644
--- a/lib/librte_eventdev/rte_event_ring.h
+++ b/lib/librte_eventdev/rte_event_ring.h
@@ -126,9 +126,8 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r,
goto end;
 
ENQUEUE_PTRS(&r->r, &r[1], prod_head, events, n, struct rte_event);
-   rte_smp_wmb();
 
-   update_tail(&r->r.prod, prod_head, prod_next, 1);
+   update_tail(&r->r.prod, prod_head, prod_next, 1, 1);
 end:
if (free_space != NULL)
*free_space = free_entries - n;
@@ -168,9 +167,8 @@ rte_event_ring_dequeue_burst(struct rte_event_ring *r,
goto end;
 
DEQUEUE_PTRS(&r->r, &r[1], cons_head, events, n, struct rte_event);
-   rte_smp_rmb();
 
-   update_tail(&r->r.cons, cons_head, cons_next, 1);
+   update_tail(&r->r.cons, cons_head, cons_next, 1, 0);
 
 end:
if (available != NULL)
diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index e34d9d9..a2682e7 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -45,6 +45,8 @@ LIBABIVER := 1
 SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
 
 # install includes
-SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
+   rte_ring_generic.h \
+   rte_ring_c11_mem.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 5e9b3b7..53764d9 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -356,85 +356,19 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
} \
 } while (0)
 
-static __rte_always_inline void
-update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
-   uint32_t single)
-{
-   /*
-* If there are other enqueues/dequeues in progress that preceded us,
-* we need to wait for them to complete
-*/
-   if (!single)
-   while (unlikely(ht->tail != old_val))
-   rte_pause();
-
-   ht->tail = new_val;
-}
-
-/**
- * @internal This function updates the producer head for enqueue
- *
- * @param r
- *   A pointer to the ring structure
- * @param is_sp
- *   Indicates whether multi-producer path is needed or not
- * @param n
- *   The number of elements we will want to enqueue, i.e. how far should the
- *   head be moved
- * @param behavior
- *   RTE_RING_QUEUE_FIXED:Enqueue a fixed number of items from a ring
- *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
- * @param old_head
- *   Returns head value as it was before the move, i.e. where enqueue starts
- * @param new_head
- *   Returns the current/new head value i.e. where enqueue finishes
- * @param free_entries
- *   Returns the amount of free space in the ring BEFORE head was moved
- * @return
- *   Actual number of objects enqueu

[dpdk-dev] [PATCH 1/3] eal/arm64: remove the braces {} for dmb(), dsb()

2017-11-07 Thread Jia He

for the code as follows:
if (condition)
rte_smp_rmb();
else
rte_smp_wmb();
Without this patch, compiler will report this error:
error: 'else' without a previous 'if'

Signed-off-by: Jia He 
Signed-off-by: jia...@hxt-semitech.com
---
 lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h 
b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
index 0b70d62..38c3393 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
@@ -43,8 +43,8 @@ extern "C" {
 
 #include "generic/rte_atomic.h"
 
-#define dsb(opt)  { asm volatile("dsb " #opt : : : "memory"); }
-#define dmb(opt)  { asm volatile("dmb " #opt : : : "memory"); }
+#define dsb(opt) asm volatile("dsb " #opt : : : "memory");
+#define dmb(opt) asm volatile("dmb " #opt : : : "memory");
 
 #define rte_mb() dsb(sy)
 
-- 
2.7.4

[dpdk-dev] [PATCH 3/3] config: support C11 memory model for arm64

2017-11-07 Thread Jia He

by default, CONFIG_RTE_RING_USE_C11_MEM_MODEL is y on arm64

Signed-off-by: Jia He 
Signed-off-by: jia...@hxt-semitech.com
---
 config/common_armv8a_linuxapp | 1 +
 1 file changed, 1 insertion(+)

diff --git a/config/common_armv8a_linuxapp b/config/common_armv8a_linuxapp
index 6732d1e..1bf6e4d 100644
--- a/config/common_armv8a_linuxapp
+++ b/config/common_armv8a_linuxapp
@@ -49,3 +49,4 @@ CONFIG_RTE_LIBRTE_SFC_EFX_PMD=n
 CONFIG_RTE_LIBRTE_AVP_PMD=n
 
 CONFIG_RTE_SCHED_VECTOR=n
+CONFIG_RTE_RING_USE_C11_MEM_MODEL=y
-- 
2.7.4

[dpdk-dev] [PATCH v4 3/4] ring: introduce new header file to include common functions

2017-11-08 Thread Jia He

move the common part of rte_ring.h into rte_ring_generic.h
move the memory barrier part into update_tail()
no functional changes here

Signed-off-by: Jia He 
Suggested-by: Jerin Jacob 
Suggested-by: Ananyev, Konstantin 
---
 lib/librte_eventdev/rte_event_ring.h |   6 +-
 lib/librte_ring/Makefile |   3 +-
 lib/librte_ring/rte_ring.h   | 159 +---
 lib/librte_ring/rte_ring_generic.h   | 194 +++
 4 files changed, 202 insertions(+), 160 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_generic.h

diff --git a/lib/librte_eventdev/rte_event_ring.h 
b/lib/librte_eventdev/rte_event_ring.h
index ea9b688..3e49458 100644
--- a/lib/librte_eventdev/rte_event_ring.h
+++ b/lib/librte_eventdev/rte_event_ring.h
@@ -126,9 +126,8 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r,
goto end;
 
ENQUEUE_PTRS(&r->r, &r[1], prod_head, events, n, struct rte_event);
-   rte_smp_wmb();
 
-   update_tail(&r->r.prod, prod_head, prod_next, 1);
+   update_tail(&r->r.prod, prod_head, prod_next, 1, 1);
 end:
if (free_space != NULL)
*free_space = free_entries - n;
@@ -168,9 +167,8 @@ rte_event_ring_dequeue_burst(struct rte_event_ring *r,
goto end;
 
DEQUEUE_PTRS(&r->r, &r[1], cons_head, events, n, struct rte_event);
-   rte_smp_rmb();
 
-   update_tail(&r->r.cons, cons_head, cons_next, 1);
+   update_tail(&r->r.cons, cons_head, cons_next, 1, 0);
 
 end:
if (available != NULL)
diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index e34d9d9..c959945 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -45,6 +45,7 @@ LIBABIVER := 1
 SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
 
 # install includes
-SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
+   rte_ring_generic.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 3e8085a..519614c 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -356,90 +356,8 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
} \
 } while (0)
 
-static __rte_always_inline void
-update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
-   uint32_t single)
-{
-   /*
-* If there are other enqueues/dequeues in progress that preceded us,
-* we need to wait for them to complete
-*/
-   if (!single)
-   while (unlikely(ht->tail != old_val))
-   rte_pause();
-
-   ht->tail = new_val;
-}
-
-/**
- * @internal This function updates the producer head for enqueue
- *
- * @param r
- *   A pointer to the ring structure
- * @param is_sp
- *   Indicates whether multi-producer path is needed or not
- * @param n
- *   The number of elements we will want to enqueue, i.e. how far should the
- *   head be moved
- * @param behavior
- *   RTE_RING_QUEUE_FIXED:Enqueue a fixed number of items from a ring
- *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
- * @param old_head
- *   Returns head value as it was before the move, i.e. where enqueue starts
- * @param new_head
- *   Returns the current/new head value i.e. where enqueue finishes
- * @param free_entries
- *   Returns the amount of free space in the ring BEFORE head was moved
- * @return
- *   Actual number of objects enqueued.
- *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
- */
-static __rte_always_inline unsigned int
-__rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
-   unsigned int n, enum rte_ring_queue_behavior behavior,
-   uint32_t *old_head, uint32_t *new_head,
-   uint32_t *free_entries)
-{
-   const uint32_t capacity = r->capacity;
-   unsigned int max = n;
-   int success;
-
-   do {
-   /* Reset n to the initial burst count */
-   n = max;
-
-   *old_head = r->prod.head;
-
-   /* add rmb barrier to avoid load/load reorder in weak
-* memory model. It is noop on x86 */
-   rte_smp_rmb();
-
-   const uint32_t cons_tail = r->cons.tail;
-   /*
-*  The subtraction is done between two unsigned 32bits value
-* (the result is always modulo 32 bits even if we have
-* *old_head > cons_tail). So 'free_entries' is always between 0
-* and capacity (which is < size).
-*/
-   *free_entries = (capacity + cons_tail - *old_head);
-
-   /* check that we have enough room in ring */
-   if (unlikely(n > *free_entries))
-   n = (behavior

[dpdk-dev] [PATCH v4 0/4] fix race condition in enqueue/dequeue because of cpu reorder

2017-11-08 Thread Jia He

We watched a rte panic of mbuf_autotest in our qualcomm arm64 server due
to a possible race condition.

To fix this race, there are 2 options as suggested by Jerin:
1. use rte_smp_rmb
2. use load_acquire/store_release(refer to [2]).
CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "y"
only on arm64 so far.

The reason why providing 2 options is due to the performance benchmark
difference in different arm machines.

Already fuctionally tested on the machines as follows:
- on X86
- on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=y
- on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=n

---
Changelog:
V4: split into small patches
V3: arch specific implementation for enqueue/dequeue barrier
V2: let users choose whether using load_acquire/store_release
V1: rte_smp_rmb() between 2 loads

Jia He (4):
  eal/arm64: remove the braces {} for dmb() and dsb()
  ring: guarantee load/load order in enqueue and dequeue
  ring: introduce new header file to include common functions
  ring: introduce new header file to support C11 memory model

 config/common_armv8a_linuxapp  |   2 +
 .../common/include/arch/arm/rte_atomic_64.h|   4 +-
 lib/librte_eventdev/rte_event_ring.h   |   6 +-
 lib/librte_ring/Makefile   |   4 +-
 lib/librte_ring/rte_ring.h | 161 ++---
 lib/librte_ring/rte_ring_c11_mem.h | 185 
 lib/librte_ring/rte_ring_generic.h | 194 +
 7 files changed, 404 insertions(+), 152 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_c11_mem.h
 create mode 100644 lib/librte_ring/rte_ring_generic.h

-- 
2.7.4

[dpdk-dev] [PATCH v4 2/4] ring: guarantee load/load order in enqueue and dequeue

2017-11-08 Thread Jia He

We watched a rte panic of mbuf_autotest in our qualcomm arm64 server.
In __rte_ring_move_cons_head()
...
do {
/* Restore n as it may change every loop */
n = max;

*old_head = r->cons.head;//1st load
const uint32_t prod_tail = r->prod.tail; //2nd load

In weak memory order architectures(powerpc,arm), the 2nd load might be
reodered before the 1st load, that makes *entries is bigger than we wanted.
This nasty reording messed enque/deque up.

cpu1(producer)  cpu2(consumer)  cpu3(consumer)
load r->prod.tail
in enqueue:
load r->cons.tail
load r->prod.head

store r->prod.tail

load r->cons.head
load r->prod.tail
...
store r->cons.{head,tail}
load r->cons.head

Then, r->cons.head will be bigger than prod_tail, then make *entries very
big and the consumer will go forward incorrectly.

After this patch, the old cons.head will be recaculated after failure of
rte_atomic32_cmpset

There is no such issue on X86, because X86 is strong memory order model.

Signed-off-by: Jia He 
Signed-off-by: jie2@hxt-semitech.com
Signed-off-by: bing.z...@hxt-semitech.com
---
 lib/librte_ring/rte_ring.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 5e9b3b7..3e8085a 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -409,6 +409,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
n = max;
 
*old_head = r->prod.head;
+
+   /* add rmb barrier to avoid load/load reorder in weak
+* memory model. It is noop on x86 */
+   rte_smp_rmb();
+
const uint32_t cons_tail = r->cons.tail;
/*
 *  The subtraction is done between two unsigned 32bits value
@@ -517,6 +522,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
n = max;
 
*old_head = r->cons.head;
+
+   /* add rmb barrier to avoid load/load reorder in weak
+* memory model. It is noop on x86 */
+   rte_smp_rmb();
+
const uint32_t prod_tail = r->prod.tail;
/* The subtraction is done between two unsigned 32bits value
 * (the result is always modulo 32 bits even if we have
-- 
2.7.4

[dpdk-dev] [PATCH v4 4/4] ring: introduce new header file to support C11 memory model

2017-11-08 Thread Jia He

To fix the cpu reorder race condition in enque/deque, there are 2 options
suggested by Jerin:
1. use rte_smp_rmb
2. use load_acquire/store_release(refer to [1]).
for the 2nd option, CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by
default it is "y" only on arm64.

The reason why providing 2 options is due to the performance benchmark
difference in different arm machines, refer to [2].

We haven't tested on ppc64. If anyone verifies it, he can add
CONFIG_RTE_RING_USE_C11_MEM_MODEL=y to ppc64 config files.

[1] https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170
[2] http://dpdk.org/ml/archives/dev/2017-October/080861.html

Signed-off-by: Jia He 
Suggested-by: Jerin Jacob 
---
 config/common_armv8a_linuxapp  |   2 +
 lib/librte_ring/Makefile   |   3 +-
 lib/librte_ring/rte_ring.h |  14 ++-
 lib/librte_ring/rte_ring_c11_mem.h | 185 +
 4 files changed, 202 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_c11_mem.h

diff --git a/config/common_armv8a_linuxapp b/config/common_armv8a_linuxapp
index 6732d1e..5b7b2eb 100644
--- a/config/common_armv8a_linuxapp
+++ b/config/common_armv8a_linuxapp
@@ -49,3 +49,5 @@ CONFIG_RTE_LIBRTE_SFC_EFX_PMD=n
 CONFIG_RTE_LIBRTE_AVP_PMD=n
 
 CONFIG_RTE_SCHED_VECTOR=n
+
+CONFIG_RTE_RING_USE_C11_MEM_MODEL=y
diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index c959945..a2682e7 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -46,6 +46,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
 
 # install includes
 SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
-   rte_ring_generic.h
+   rte_ring_generic.h \
+   rte_ring_c11_mem.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 519614c..3343eba 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -356,8 +356,20 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
} \
 } while (0)
 
-/* Move common functions to generic file */
+/* Between load and load. there might be cpu reorder in weak model
+ * (powerpc/arm).
+ * There are 2 choices for the users
+ * 1.use rmb() memory barrier
+ * 2.use one-direcion load_acquire/store_release barrier,defined by
+ * CONFIG_RTE_RING_USE_C11_MEM_MODEL=y
+ * It depends on performance test results.
+ * By default, move common functions to rte_ring_generic.h
+ */
+#ifdef RTE_RING_USE_C11_MEM_MODEL
+#include "rte_ring_c11_mem.h"
+#else
 #include "rte_ring_generic.h"
+#endif
 
 /**
  * @internal Enqueue several objects on the ring
diff --git a/lib/librte_ring/rte_ring_c11_mem.h 
b/lib/librte_ring/rte_ring_c11_mem.h
new file mode 100644
index 000..c167b46
--- /dev/null
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -0,0 +1,185 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2017 hxt-semitech. All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ *   notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ *   notice, this list of conditions and the following disclaimer in
+ *   the documentation and/or other materials provided with the
+ *   distribution.
+ * * Neither the name of hxt-semitech nor the names of its
+ *   contributors may be used to endorse or promote products derived
+ *   from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_RING_C11_MEM_H_
+#define _RTE_RING_C11_MEM_H_
+
+static __rte_always_inline void
+update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
+   uint32_t single, uint32_t enqueue)
+{
+   RTE_SET_USED(enqueue);
+
+   /*
+* If there are other enqueues/dequeues in progress tha

[dpdk-dev] [PATCH v4 1/4] eal/arm64: remove the braces {} for dmb() and dsb()

2017-11-08 Thread Jia He

for the code as follows:
if (condition)
rte_smp_rmb();
else
rte_smp_wmb();
Without this patch, compiler will report this error:
error: 'else' without a previous 'if'

Fixes: 84733fd0d75e ("eal/arm64: fix memory barrier definition")
Signed-off-by: Jia He 
---
 lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h 
b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
index 0b70d62..71da29c 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
@@ -43,8 +43,8 @@ extern "C" {
 
 #include "generic/rte_atomic.h"
 
-#define dsb(opt)  { asm volatile("dsb " #opt : : : "memory"); }
-#define dmb(opt)  { asm volatile("dmb " #opt : : : "memory"); }
+#define dsb(opt) asm volatile("dsb " #opt : : : "memory")
+#define dmb(opt) asm volatile("dmb " #opt : : : "memory")
 
 #define rte_mb() dsb(sy)
 
-- 
2.7.4

Re: [dpdk-dev] [PATCH v4 0/4] fix race condition in enqueue/dequeue because of cpu reorder

2017-11-08 Thread Jia He


Hi Bruce


On 11/8/2017 8:15 PM, Bruce Richardson Wrote:

On Wed, Nov 08, 2017 at 09:54:37AM +, Jia He wrote:

We watched a rte panic of mbuf_autotest in our qualcomm arm64 server
due to a possible race condition.

To fix this race, there are 2 options as suggested by Jerin: 1. use
rte_smp_rmb 2. use load_acquire/store_release(refer to [2]).
CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is
"y" only on arm64 so far.

The reason why providing 2 options is due to the performance benchmark
difference in different arm machines.

Already fuctionally tested on the machines as follows: - on X86 - on
arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=y - on arm64 with
CONFIG_RTE_RING_USE_C11_MEM_MODEL=n

--- Changelog: V4: split into small patches V3: arch specific
implementation for enqueue/dequeue barrier V2: let users choose
whether using load_acquire/store_release V1: rte_smp_rmb() between 2
loads

Jia He (4): eal/arm64: remove the braces {} for dmb() and dsb() ring:
guarantee load/load order in enqueue and dequeue ring: introduce new
header file to include common functions ring: introduce new header
file to support C11 memory model


I'm wondering what the merge plans are for this set, given we are now
past RC3 in 17.11? As the rings are broken on ARM machines we need to
merge in some fix, but I'm a little concerned about the scope of the
changes from the 3rd and 4th patches. Would it be acceptable to just
merge in patches 1 & 2 in 17.11 and leave the rework and C11 memory
model additions in patches 3 & 4 to 18.02 release?

As far as I'm concerned, it is ok.

Cheers,
Jia

Re: [dpdk-dev] [PATCH 1/3] eal/arm64: remove the braces {} for dmb(), dsb()

2017-11-08 Thread Jia He


Hi Bruce


On 11/8/2017 6:28 PM, Bruce Richardson Wrote:

On Wed, Nov 08, 2017 at 06:17:10AM +, Jia He wrote:

for the code as follows:
if (condition)
rte_smp_rmb();
else
rte_smp_wmb();
Without this patch, compiler will report this error:
error: 'else' without a previous 'if'

Signed-off-by: Jia He 
Signed-off-by: jia...@hxt-semitech.com
---
  lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h 
b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
index 0b70d62..38c3393 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
@@ -43,8 +43,8 @@ extern "C" {
  
  #include "generic/rte_atomic.h"
  
-#define dsb(opt)  { asm volatile("dsb " #opt : : : "memory"); }

-#define dmb(opt)  { asm volatile("dmb " #opt : : : "memory"); }
+#define dsb(opt) asm volatile("dsb " #opt : : : "memory");
+#define dmb(opt) asm volatile("dmb " #opt : : : "memory");
  

Need to remove the trailing ";" I too I think.
Alternatively, to keep the braces, the standard practice is to use
do { ... } while(0)

If trailing ";" is not removed
the code:
if (condition)
    rte_smp_rmb();
else
    anything();

will be like below after precompiling:
if (condition)
    asm volatile("dsb " "ld" : : : "memory");;
else
    anything();
Then, the same error - error: 'else' without a previous 'if'

If you choose do/while(0), yes, no errors from compiler.
But the checkpatch will report

WARNING: Single statement macros should not use a do {} while (0) loop
#11: FILE: lib/librte_eal/common/include/arch/arm/rte_atomic_64.h:46:
+#define dsb(opt) do { asm volatile("dsb " #opt : : : "memory"); } while (0)

I searched the kernel codes, the marco dsb() didn't use the do/while(0).
Which one do you think is better for dpdk?

Cheers,
Jia

Re: [dpdk-dev] [PATCH 1/3] eal/arm64: remove the braces {} for dmb(), dsb()

2017-11-08 Thread Jia He




On 11/9/2017 9:22 AM, Jia He Wrote:

Hi Bruce


On 11/8/2017 6:28 PM, Bruce Richardson Wrote:

On Wed, Nov 08, 2017 at 06:17:10AM +, Jia He wrote:

for the code as follows:
if (condition)
rte_smp_rmb();
else
rte_smp_wmb();
Without this patch, compiler will report this error:
error: 'else' without a previous 'if'

Signed-off-by: Jia He 
Signed-off-by: jia...@hxt-semitech.com
---
  lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h 
b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h

index 0b70d62..38c3393 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
@@ -43,8 +43,8 @@ extern "C" {
    #include "generic/rte_atomic.h"
  -#define dsb(opt)  { asm volatile("dsb " #opt : : : "memory"); }
-#define dmb(opt)  { asm volatile("dmb " #opt : : : "memory"); }
+#define dsb(opt) asm volatile("dsb " #opt : : : "memory");
+#define dmb(opt) asm volatile("dmb " #opt : : : "memory");

Need to remove the trailing ";" I too I think.
Alternatively, to keep the braces, the standard practice is to use
do { ... } while(0)

If trailing ";" is not removed
the code:
if (condition)
    rte_smp_rmb();
else
    anything();

will be like below after precompiling:
if (condition)
    asm volatile("dsb " "ld" : : : "memory");;
else
    anything();
Then, the same error - error: 'else' without a previous 'if'


Ignore my words above，thanks
sorry for the inconvenience.
And I've sent out v4 in this mail thread. The ";" has been removed.
If no more comments, I will send out v5 (2 patch sets for 17 and 18)

--
Cheers,
Jia


--
Cheers,
Jia

Re: [dpdk-dev] [PATCH 1/3] eal/arm64: remove the braces {} for dmb(), dsb()

2017-11-08 Thread Jia He


Hi Jianbo


On 11/9/2017 11:21 AM, Jianbo Liu Wrote:

The 11/09/2017 11:14, Jia He wrote:


On 11/9/2017 9:22 AM, Jia He Wrote:

Hi Bruce


On 11/8/2017 6:28 PM, Bruce Richardson Wrote:

On Wed, Nov 08, 2017 at 06:17:10AM +, Jia He wrote:

for the code as follows:
if (condition)
 rte_smp_rmb();
else
 rte_smp_wmb();
Without this patch, compiler will report this error:
error: 'else' without a previous 'if'

Signed-off-by: Jia He 
Signed-off-by: jia...@hxt-semitech.com
---
   lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 4 ++--
   1 file changed, 2 insertions(+), 2 deletions(-)

diff --git
a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
index 0b70d62..38c3393 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
@@ -43,8 +43,8 @@ extern "C" {
     #include "generic/rte_atomic.h"
   -#define dsb(opt)  { asm volatile("dsb " #opt : : : "memory"); }
-#define dmb(opt)  { asm volatile("dmb " #opt : : : "memory"); }
+#define dsb(opt) asm volatile("dsb " #opt : : : "memory");
+#define dmb(opt) asm volatile("dmb " #opt : : : "memory");

Need to remove the trailing ";" I too I think.
Alternatively, to keep the braces, the standard practice is to use
do { ... } while(0)

If trailing ";" is not removed
the code:
if (condition)
     rte_smp_rmb();
else
     anything();


Sorry, why not use two different functions as your conditions passed in
are fixed in the calling functions.
Do you mean to split update_tail() into update_tail_enqueue() and 
update_tail_dequeue()?

Cheers,
Jia

[dpdk-dev] [PATCH v5 1/1] ring: guarantee load/load order in enqueue and dequeue

2017-11-09 Thread Jia He

We watched a rte panic of mbuf_autotest in our qualcomm arm64 server.
In __rte_ring_move_cons_head()
...
do {
/* Restore n as it may change every loop */
n = max;

*old_head = r->cons.head;//1st load
const uint32_t prod_tail = r->prod.tail; //2nd load

cpu1(producer)  cpu2(consumer)  cpu3(consumer)
load r->prod.tail
in enqueue:
load r->cons.tail
load r->prod.head

store r->prod.tail

load r->cons.head
load r->prod.tail
...
store r->cons.{head,tail}
load r->cons.head

In weak memory order architectures(powerpc,arm), the 2nd load might be
reodered before the 1st load, that makes *entries is bigger than we
wanted. This nasty reording messed enque/deque up. Then, r->cons.head
will be bigger than prod_tail, then make *entries very big and the
consumer will go forward incorrectly.

After this patch, even with above context switches, the old cons.head
will be recaculated after failure of rte_atomic32_cmpset. So no race
conditions left.

There is no such issue on X86, because X86 is strong memory order model.
But rte_smp_rmb() doesn't have impact on runtime performance on X86, so
keep the same code without architectures specific concerns.

Signed-off-by: Jia He 
Signed-off-by: jie2@hxt-semitech.com
Signed-off-by: bing.z...@hxt-semitech.com
---
 lib/librte_ring/rte_ring.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 5e9b3b7..3e8085a 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -409,6 +409,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
n = max;
 
*old_head = r->prod.head;
+
+   /* add rmb barrier to avoid load/load reorder in weak
+* memory model. It is noop on x86 */
+   rte_smp_rmb();
+
const uint32_t cons_tail = r->cons.tail;
/*
 *  The subtraction is done between two unsigned 32bits value
@@ -517,6 +522,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
n = max;
 
*old_head = r->cons.head;
+
+   /* add rmb barrier to avoid load/load reorder in weak
+* memory model. It is noop on x86 */
+   rte_smp_rmb();
+
const uint32_t prod_tail = r->prod.tail;
/* The subtraction is done between two unsigned 32bits value
 * (the result is always modulo 32 bits even if we have
-- 
2.7.4

[dpdk-dev] [PATCH v5 0/1] fix race condition in enqueue/dequeue because of cpu reorder

2017-11-09 Thread Jia He

We watched a rte panic of mbuf_autotest in our qualcomm arm64 server due
to a possible race condition. To fix this race condition, rmb() is needed to add
between the 2 loads.

Already fuctionally tested on the machines as follows:
- on X86
- on arm64 

---
Changelog
V5: split it into 2 patchset due to the milestone concerns, this is the 1st one
V4: split into small patches
V3: arch specific implementation for enqueue/dequeue barrier
V2: let users choose whether using load_acquire/store_release
V1: rte_smp_rmb() between 2 loads

Jia He (4):
  eal/arm64: remove the braces {} for dmb() and dsb()
  ring: guarantee load/load order in enqueue and dequeue
  ring: introduce new header file to include common functions
  ring: introduce new header file to support C11 memory model

 config/common_armv8a_linuxapp  |   2 +
 .../common/include/arch/arm/rte_atomic_64.h|   4 +-
 lib/librte_eventdev/rte_event_ring.h   |   6 +-
 lib/librte_ring/Makefile   |   4 +-
 lib/librte_ring/rte_ring.h | 161 ++---
 lib/librte_ring/rte_ring_c11_mem.h | 185 
 lib/librte_ring/rte_ring_generic.h | 194 +
 7 files changed, 404 insertions(+), 152 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_c11_mem.h
 create mode 100644 lib/librte_ring/rte_ring_generic.h

-- 
2.7.4

Re: [dpdk-dev] [PATCH 1/3] eal/arm64: remove the braces {} for dmb(), dsb()

2017-11-09 Thread Jia He




On 11/9/2017 5:38 PM, Ananyev, Konstantin Wrote:



-Original Message-
From: Jianbo Liu [mailto:jianbo@arm.com]
Sent: Thursday, November 9, 2017 4:56 AM
To: Jia He 
Cc: Richardson, Bruce ; 
jerin.ja...@caviumnetworks.com; dev@dpdk.org; olivier.m...@6wind.com;
Ananyev, Konstantin ; hemant.agra...@nxp.com; 
jia...@hxt-semitech.com
Subject: Re: [PATCH 1/3] eal/arm64: remove the braces {} for dmb(),dsb()

The 11/09/2017 12:43, Jia He wrote:

Hi Jianbo


On 11/9/2017 11:21 AM, Jianbo Liu Wrote:

The 11/09/2017 11:14, Jia He wrote:

On 11/9/2017 9:22 AM, Jia He Wrote:

Hi Bruce


On 11/8/2017 6:28 PM, Bruce Richardson Wrote:

On Wed, Nov 08, 2017 at 06:17:10AM +, Jia He wrote:

for the code as follows:
if (condition)
 rte_smp_rmb();
else
 rte_smp_wmb();
Without this patch, compiler will report this error:
error: 'else' without a previous 'if'

Signed-off-by: Jia He 
Signed-off-by: jia...@hxt-semitech.com
---
   lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 4 ++--
   1 file changed, 2 insertions(+), 2 deletions(-)

diff --git
a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
index 0b70d62..38c3393 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
@@ -43,8 +43,8 @@ extern "C" {
 #include "generic/rte_atomic.h"
   -#define dsb(opt)  { asm volatile("dsb " #opt : : : "memory"); }
-#define dmb(opt)  { asm volatile("dmb " #opt : : : "memory"); }
+#define dsb(opt) asm volatile("dsb " #opt : : : "memory");
+#define dmb(opt) asm volatile("dmb " #opt : : : "memory");

Need to remove the trailing ";" I too I think.
Alternatively, to keep the braces, the standard practice is to use
do { ... } while(0)

If trailing ";" is not removed
the code:
if (condition)
 rte_smp_rmb();
else
 anything();


Sorry, why not use two different functions as your conditions passed in
are fixed in the calling functions.

Do you mean to split update_tail() into update_tail_enqueue() and
update_tail_dequeue()?

Yes. So it's not need to change dsb/dmb.

That's a good idea - but you still might hit the same problem in
Some different place in future...
Why not to convert these macros into 'always_inline' functions then?
Konstantin


It makes things more complex
opt needs to be redefined with types
such as : __attribute__((always_inline)) void dsb( char* opt)
and the input paramenter shoud be
#define sy "sy"
#define ld "ld"

And the "#" in asm codes needs to be considerred more.

IMO, the kernel way is simple and clean, isn't it?
#define dmb(opt) asm volatile("dmb " #opt : : : "memory")
Another choice is adding the do/while.

@Ananyev @Jianbo
Any thoughts?

--
Cheers,
Jia

[dpdk-dev] [PATCH v6] guarantee load/load order in enqueue and dequeue

2017-11-09 Thread Jia He

From: Jia He 

We watched a rte panic of mbuf_autotest in our qualcomm arm64 server
(Amberwing) due to a possible race condition. To fix this race condition,
rmb() is needed to add between the 2 loads.

Already fuctionally tested on the machines as follows:
- on X86
- on arm64 

---
Changelog
V6: improve the text description and fix the checkpatch warnings
V5: split it into 2 patchset due to the milestone concerns, this is the 1st one
V4: split into small patches
V3: arch specific implementation for enqueue/dequeue barrier
V2: let users choose whether using load_acquire/store_release
V1: rte_smp_rmb() between 2 loads

Jia He (1):
  ring: guarantee load/load order in enqueue and dequeue

 lib/librte_ring/rte_ring.h | 12 
 1 file changed, 12 insertions(+)

-- 
2.7.4

[dpdk-dev] [PATCH v6] ring: guarantee load/load order in enqueue and dequeue

2017-11-09 Thread Jia He

We watched a rte panic of mbuf_autotest in our qualcomm arm64 server
(Amberwing).

Root cause:
In __rte_ring_move_cons_head()
...
do {
/* Restore n as it may change every loop */
n = max;

*old_head = r->cons.head;//1st load
const uint32_t prod_tail = r->prod.tail; //2nd load

In weak memory order architectures(powerpc,arm), the 2nd load might be
reodered before the 1st load, that makes *entries is bigger than we wanted.
This nasty reording messed enque/deque up.

cpu1(producer)  cpu2(consumer)  cpu3(consumer)
load r->prod.tail
in enqueue:
load r->cons.tail
load r->prod.head

store r->prod.tail

load r->cons.head
load r->prod.tail
...
store r->cons.{head,tail}
load r->cons.head

Then, r->cons.head will be bigger than prod_tail, then make *entries very
big and the consumer will go forward incorrectly.

After this patch, the old cons.head will be recaculated after failure of
rte_atomic32_cmpset

There is no such issue on X86, because X86 is strong memory order model.
But rte_smp_rmb() doesn't have impact on runtime performance on X86, so
keep the same code without architectures specific concerns.

Signed-off-by: Jia He 
Signed-off-by: jie2@hxt-semitech.com
Signed-off-by: bing.z...@hxt-semitech.com
Acked-by: Jerin Jacob 
Acked-by: Jianbo Liu 
Cc: sta...@dpdk.org

---
 lib/librte_ring/rte_ring.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 5e9b3b7..e924438 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -409,6 +409,12 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
n = max;
 
*old_head = r->prod.head;
+
+   /* add rmb barrier to avoid load/load reorder in weak
+* memory model. It is noop on x86
+*/
+   rte_smp_rmb();
+
const uint32_t cons_tail = r->cons.tail;
/*
 *  The subtraction is done between two unsigned 32bits value
@@ -517,6 +523,12 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
n = max;
 
*old_head = r->cons.head;
+
+   /* add rmb barrier to avoid load/load reorder in weak
+* memory model. It is noop on x86
+*/
+   rte_smp_rmb();
+
const uint32_t prod_tail = r->prod.tail;
/* The subtraction is done between two unsigned 32bits value
 * (the result is always modulo 32 bits even if we have
-- 
2.7.4

[dpdk-dev] [PATCH v5 0/3] support c11 memory model barrier in librte_ring

2017-11-09 Thread Jia He

From: Jia He 

To support C11 memory model barrier, 2 options are suggested by Jerin:
1. use rte_smp_rmb
2. use load_acquire/store_release
CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "y"
only on arm64 so far.

The reason why providing 2 options is due to the performance benchmark
difference in different arm machines.

Already fuctionally tested on the machines as follows:
- on X86
- on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=y
- on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=n

---
Changelog:
V5: split it into 2 patchset due to the milestone concerns, this is the 2st 
one. Also fix checkpatch.pl warnings
V4: split into small patches
V3: arch specific implementation for enqueue/dequeue barrier
V2: let users choose whether using load_acquire/store_release
V1: rte_smp_rmb() between 2 loads

*** BLURB HERE ***

Jia He (3):
  eal/arm64: remove the braces {} for dmb() and dsb()
  ring: introduce new header file to include common functions
  ring: introduce new header file to support C11 memory model

 config/common_armv8a_linuxapp  |   2 +
 .../common/include/arch/arm/rte_atomic_64.h|   4 +-
 lib/librte_eventdev/rte_event_ring.h   |   6 +-
 lib/librte_ring/Makefile   |   4 +-
 lib/librte_ring/rte_ring.h | 173 ++
 lib/librte_ring/rte_ring_c11_mem.h | 186 
 lib/librte_ring/rte_ring_generic.h | 195 +
 7 files changed, 406 insertions(+), 164 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_c11_mem.h
 create mode 100644 lib/librte_ring/rte_ring_generic.h

-- 
2.7.4

[dpdk-dev] [PATCH v5 2/3] ring: introduce new header file to include common functions

2017-11-09 Thread Jia He

From: Jia He 

move the common part of rte_ring.h into rte_ring_generic.h
move the memory barrier part into update_tail()
no functional changes here

Signed-off-by: Jia He 
Suggested-by: Jerin Jacob 
Suggested-by: Ananyev, Konstantin 
---
 lib/librte_eventdev/rte_event_ring.h |   6 +-
 lib/librte_ring/Makefile |   3 +-
 lib/librte_ring/rte_ring.h   | 161 +
 lib/librte_ring/rte_ring_generic.h   | 195 +++
 4 files changed, 203 insertions(+), 162 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_generic.h

diff --git a/lib/librte_eventdev/rte_event_ring.h 
b/lib/librte_eventdev/rte_event_ring.h
index ea9b688..3e49458 100644
--- a/lib/librte_eventdev/rte_event_ring.h
+++ b/lib/librte_eventdev/rte_event_ring.h
@@ -126,9 +126,8 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r,
goto end;
 
ENQUEUE_PTRS(&r->r, &r[1], prod_head, events, n, struct rte_event);
-   rte_smp_wmb();
 
-   update_tail(&r->r.prod, prod_head, prod_next, 1);
+   update_tail(&r->r.prod, prod_head, prod_next, 1, 1);
 end:
if (free_space != NULL)
*free_space = free_entries - n;
@@ -168,9 +167,8 @@ rte_event_ring_dequeue_burst(struct rte_event_ring *r,
goto end;
 
DEQUEUE_PTRS(&r->r, &r[1], cons_head, events, n, struct rte_event);
-   rte_smp_rmb();
 
-   update_tail(&r->r.cons, cons_head, cons_next, 1);
+   update_tail(&r->r.cons, cons_head, cons_next, 1, 0);
 
 end:
if (available != NULL)
diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index e34d9d9..c959945 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -45,6 +45,7 @@ LIBABIVER := 1
 SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
 
 # install includes
-SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
+   rte_ring_generic.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index e924438..519614c 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -356,91 +356,8 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
} \
 } while (0)
 
-static __rte_always_inline void
-update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
-   uint32_t single)
-{
-   /*
-* If there are other enqueues/dequeues in progress that preceded us,
-* we need to wait for them to complete
-*/
-   if (!single)
-   while (unlikely(ht->tail != old_val))
-   rte_pause();
-
-   ht->tail = new_val;
-}
-
-/**
- * @internal This function updates the producer head for enqueue
- *
- * @param r
- *   A pointer to the ring structure
- * @param is_sp
- *   Indicates whether multi-producer path is needed or not
- * @param n
- *   The number of elements we will want to enqueue, i.e. how far should the
- *   head be moved
- * @param behavior
- *   RTE_RING_QUEUE_FIXED:Enqueue a fixed number of items from a ring
- *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
- * @param old_head
- *   Returns head value as it was before the move, i.e. where enqueue starts
- * @param new_head
- *   Returns the current/new head value i.e. where enqueue finishes
- * @param free_entries
- *   Returns the amount of free space in the ring BEFORE head was moved
- * @return
- *   Actual number of objects enqueued.
- *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
- */
-static __rte_always_inline unsigned int
-__rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
-   unsigned int n, enum rte_ring_queue_behavior behavior,
-   uint32_t *old_head, uint32_t *new_head,
-   uint32_t *free_entries)
-{
-   const uint32_t capacity = r->capacity;
-   unsigned int max = n;
-   int success;
-
-   do {
-   /* Reset n to the initial burst count */
-   n = max;
-
-   *old_head = r->prod.head;
-
-   /* add rmb barrier to avoid load/load reorder in weak
-* memory model. It is noop on x86
-*/
-   rte_smp_rmb();
-
-   const uint32_t cons_tail = r->cons.tail;
-   /*
-*  The subtraction is done between two unsigned 32bits value
-* (the result is always modulo 32 bits even if we have
-* *old_head > cons_tail). So 'free_entries' is always between 0
-* and capacity (which is < size).
-*/
-   *free_entries = (capacity + cons_tail - *old_head);
-
-   /* check that we have enough room in ring */
-   if (unlikely(

[dpdk-dev] [PATCH v5 1/3] eal/arm64: remove the braces {} for dmb() and dsb()

2017-11-09 Thread Jia He

for the code as follows:
if (condition)
rte_smp_rmb();
else
rte_smp_wmb();
Without this patch, compiler will report this error:
error: 'else' without a previous 'if'

Fixes: 84733fd0d75e ("eal/arm64: fix memory barrier definition")
Signed-off-by: Jia He 
---
 lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h 
b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
index 0b70d62..71da29c 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
@@ -43,8 +43,8 @@ extern "C" {
 
 #include "generic/rte_atomic.h"
 
-#define dsb(opt)  { asm volatile("dsb " #opt : : : "memory"); }
-#define dmb(opt)  { asm volatile("dmb " #opt : : : "memory"); }
+#define dsb(opt) asm volatile("dsb " #opt : : : "memory")
+#define dmb(opt) asm volatile("dmb " #opt : : : "memory")
 
 #define rte_mb() dsb(sy)
 
-- 
2.7.4

[dpdk-dev] [PATCH v6 3/3] ring: introduce new header file to support C11 memory model

2017-11-09 Thread Jia He

To support C11 memory model barrier, 2 options are suggested by Jerin:
1. use rte_smp_rmb
2. use load_acquire/store_release(refer to [1]).
CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "y"
only on arm64 so far.

The reason why providing 2 options is due to the performance benchmark
difference in different arm machines, refer to [2].

We haven't tested on ppc64. If anyone verifies it, he can add
CONFIG_RTE_RING_USE_C11_MEM_MODEL=y to ppc64 config files.

[1] https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170
[2] http://dpdk.org/ml/archives/dev/2017-October/080861.html

Signed-off-by: Jia He 
Suggested-by: Jerin Jacob 
---
 config/common_armv8a_linuxapp  |   2 +
 lib/librte_ring/Makefile   |   3 +-
 lib/librte_ring/rte_ring.h |  14 ++-
 lib/librte_ring/rte_ring_c11_mem.h | 186 +
 4 files changed, 203 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_c11_mem.h

diff --git a/config/common_armv8a_linuxapp b/config/common_armv8a_linuxapp
index 6732d1e..5b7b2eb 100644
--- a/config/common_armv8a_linuxapp
+++ b/config/common_armv8a_linuxapp
@@ -49,3 +49,5 @@ CONFIG_RTE_LIBRTE_SFC_EFX_PMD=n
 CONFIG_RTE_LIBRTE_AVP_PMD=n
 
 CONFIG_RTE_SCHED_VECTOR=n
+
+CONFIG_RTE_RING_USE_C11_MEM_MODEL=y
diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index c959945..a2682e7 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -46,6 +46,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
 
 # install includes
 SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
-   rte_ring_generic.h
+   rte_ring_generic.h \
+   rte_ring_c11_mem.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 519614c..3343eba 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -356,8 +356,20 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
} \
 } while (0)
 
-/* Move common functions to generic file */
+/* Between load and load. there might be cpu reorder in weak model
+ * (powerpc/arm).
+ * There are 2 choices for the users
+ * 1.use rmb() memory barrier
+ * 2.use one-direcion load_acquire/store_release barrier,defined by
+ * CONFIG_RTE_RING_USE_C11_MEM_MODEL=y
+ * It depends on performance test results.
+ * By default, move common functions to rte_ring_generic.h
+ */
+#ifdef RTE_RING_USE_C11_MEM_MODEL
+#include "rte_ring_c11_mem.h"
+#else
 #include "rte_ring_generic.h"
+#endif
 
 /**
  * @internal Enqueue several objects on the ring
diff --git a/lib/librte_ring/rte_ring_c11_mem.h 
b/lib/librte_ring/rte_ring_c11_mem.h
new file mode 100644
index 000..ca21e95
--- /dev/null
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -0,0 +1,186 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2017 hxt-semitech. All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ *   notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ *   notice, this list of conditions and the following disclaimer in
+ *   the documentation and/or other materials provided with the
+ *   distribution.
+ * * Neither the name of hxt-semitech nor the names of its
+ *   contributors may be used to endorse or promote products derived
+ *   from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_RING_C11_MEM_H_
+#define _RTE_RING_C11_MEM_H_
+
+static __rte_always_inline void
+update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
+   uint32_t single, uint32_t enqueue)
+{
+   RTE_SET_USED(enqueue);
+
+   /*
+* If there are other enqueues/dequeues in progress that preceded us,
+* we need to wai

Re: [dpdk-dev] [PATCH v4 0/4] fix race condition in enqueue/dequeue because of cpu reorder

2017-11-23 Thread Jia He


Hi Bruce

I knew little about DPDK's milestone

I read some hints from http://dpdk.org/dev/roadmap

18.02

 * Proposal deadline: November 24, 2017
 * Integration deadline: January 5, 2018

I wonder whether I need to resend the patchset(the 2nd part) again after 
November 24?

Thanks for any suggestions

Cheers,
Jia
On 11/8/2017 11:11 PM, Jia He Wrote:

Hi Bruce


On 11/8/2017 8:15 PM, Bruce Richardson Wrote:

On Wed, Nov 08, 2017 at 09:54:37AM +0000, Jia He wrote:

We watched a rte panic of mbuf_autotest in our qualcomm arm64 server
due to a possible race condition.

To fix this race, there are 2 options as suggested by Jerin: 1. use
rte_smp_rmb 2. use load_acquire/store_release(refer to [2]).
CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is
"y" only on arm64 so far.

The reason why providing 2 options is due to the performance benchmark
difference in different arm machines.

Already fuctionally tested on the machines as follows: - on X86 - on
arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=y - on arm64 with
CONFIG_RTE_RING_USE_C11_MEM_MODEL=n

--- Changelog: V4: split into small patches V3: arch specific
implementation for enqueue/dequeue barrier V2: let users choose
whether using load_acquire/store_release V1: rte_smp_rmb() between 2
loads

Jia He (4): eal/arm64: remove the braces {} for dmb() and dsb() ring:
guarantee load/load order in enqueue and dequeue ring: introduce new
header file to include common functions ring: introduce new header
file to support C11 memory model


I'm wondering what the merge plans are for this set, given we are now
past RC3 in 17.11? As the rings are broken on ARM machines we need to
merge in some fix, but I'm a little concerned about the scope of the
changes from the 3rd and 4th patches. Would it be acceptable to just
merge in patches 1 & 2 in 17.11 and leave the rework and C11 memory
model additions in patches 3 & 4 to 18.02 release?

As far as I'm concerned, it is ok.

Cheers,
Jia

[dpdk-dev] [PATCH V6 0/3] support c11 memory model barrier in librte_ring

2017-11-26 Thread Jia He

To support C11 memory model barrier, 2 options are suggested by Jerin:
1. use rte_smp_rmb
2. use load_acquire/store_release
CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "y"
only on arm64 so far.

The reason why providing 2 options is due to the performance benchmark
difference in different arm machines.

Already fuctionally tested on the machines as follows:
- on X86
- on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=y
- on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=n

---
Changelog:
V6: minor change in subject and log
V5: split it into 2 patchset due to the milestone concerns, this is the 2st 
one. Also fix checkpatch.pl warnings
V4: split into small patches
V3: arch specific implementation for enqueue/dequeue barrier
V2: let users choose whether using load_acquire/store_release
V1: rte_smp_rmb() between 2 loads

Jia He (3):
  eal/arm64: remove the braces {} for dmb(),dsb()
  ring: introduce new header file to include common functions
  ring: introduce new header file to support C11 memory model

 config/common_armv8a_linuxapp  |   2 +
 .../common/include/arch/arm/rte_atomic_64.h|   4 +-
 lib/librte_eventdev/rte_event_ring.h   |   6 +-
 lib/librte_ring/Makefile   |   4 +-
 lib/librte_ring/rte_ring.h | 173 ++
 lib/librte_ring/rte_ring_generic.h | 195 +
 6 files changed, 220 insertions(+), 164 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_generic.h

-- 
2.7.4

[dpdk-dev] [PATCH V6 1/3] eal/arm64: remove the braces {} for dmb() and dsb()

2017-11-26 Thread Jia He

for the code as follows:
if (condition)
rte_smp_rmb();
else
rte_smp_wmb();
Without this patch, compiler will report this error:
error: 'else' without a previous 'if'

Fixes: 84733fd0d75e ("eal/arm: fix memory barrier definition")
Cc: sta...@dpdk.org
Signed-off-by: Jia He 
---
 lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h 
b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
index 0b70d62..71da29c 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
@@ -43,8 +43,8 @@ extern "C" {
 
 #include "generic/rte_atomic.h"
 
-#define dsb(opt)  { asm volatile("dsb " #opt : : : "memory"); }
-#define dmb(opt)  { asm volatile("dmb " #opt : : : "memory"); }
+#define dsb(opt) asm volatile("dsb " #opt : : : "memory")
+#define dmb(opt) asm volatile("dmb " #opt : : : "memory")
 
 #define rte_mb() dsb(sy)
 
-- 
2.7.4

[dpdk-dev] [PATCH V6 2/3] ring: introduce new header file to include common functions

2017-11-26 Thread Jia He

move the common part of rte_ring.h into rte_ring_generic.h.
move the memory barrier part into update_tail().

no functional changes here.

Signed-off-by: Jia He 
Suggested-by: Jerin Jacob 
Suggested-by: Ananyev, Konstantin 
---
 lib/librte_eventdev/rte_event_ring.h |   6 +-
 lib/librte_ring/Makefile |   3 +-
 lib/librte_ring/rte_ring.h   | 161 +
 lib/librte_ring/rte_ring_generic.h   | 195 +++
 4 files changed, 203 insertions(+), 162 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_generic.h

diff --git a/lib/librte_eventdev/rte_event_ring.h 
b/lib/librte_eventdev/rte_event_ring.h
index ea9b688..3e49458 100644
--- a/lib/librte_eventdev/rte_event_ring.h
+++ b/lib/librte_eventdev/rte_event_ring.h
@@ -126,9 +126,8 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r,
goto end;
 
ENQUEUE_PTRS(&r->r, &r[1], prod_head, events, n, struct rte_event);
-   rte_smp_wmb();
 
-   update_tail(&r->r.prod, prod_head, prod_next, 1);
+   update_tail(&r->r.prod, prod_head, prod_next, 1, 1);
 end:
if (free_space != NULL)
*free_space = free_entries - n;
@@ -168,9 +167,8 @@ rte_event_ring_dequeue_burst(struct rte_event_ring *r,
goto end;
 
DEQUEUE_PTRS(&r->r, &r[1], cons_head, events, n, struct rte_event);
-   rte_smp_rmb();
 
-   update_tail(&r->r.cons, cons_head, cons_next, 1);
+   update_tail(&r->r.cons, cons_head, cons_next, 1, 0);
 
 end:
if (available != NULL)
diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index e34d9d9..c959945 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -45,6 +45,7 @@ LIBABIVER := 1
 SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
 
 # install includes
-SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
+   rte_ring_generic.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index e924438..519614c 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -356,91 +356,8 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
} \
 } while (0)
 
-static __rte_always_inline void
-update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
-   uint32_t single)
-{
-   /*
-* If there are other enqueues/dequeues in progress that preceded us,
-* we need to wait for them to complete
-*/
-   if (!single)
-   while (unlikely(ht->tail != old_val))
-   rte_pause();
-
-   ht->tail = new_val;
-}
-
-/**
- * @internal This function updates the producer head for enqueue
- *
- * @param r
- *   A pointer to the ring structure
- * @param is_sp
- *   Indicates whether multi-producer path is needed or not
- * @param n
- *   The number of elements we will want to enqueue, i.e. how far should the
- *   head be moved
- * @param behavior
- *   RTE_RING_QUEUE_FIXED:Enqueue a fixed number of items from a ring
- *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
- * @param old_head
- *   Returns head value as it was before the move, i.e. where enqueue starts
- * @param new_head
- *   Returns the current/new head value i.e. where enqueue finishes
- * @param free_entries
- *   Returns the amount of free space in the ring BEFORE head was moved
- * @return
- *   Actual number of objects enqueued.
- *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
- */
-static __rte_always_inline unsigned int
-__rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
-   unsigned int n, enum rte_ring_queue_behavior behavior,
-   uint32_t *old_head, uint32_t *new_head,
-   uint32_t *free_entries)
-{
-   const uint32_t capacity = r->capacity;
-   unsigned int max = n;
-   int success;
-
-   do {
-   /* Reset n to the initial burst count */
-   n = max;
-
-   *old_head = r->prod.head;
-
-   /* add rmb barrier to avoid load/load reorder in weak
-* memory model. It is noop on x86
-*/
-   rte_smp_rmb();
-
-   const uint32_t cons_tail = r->cons.tail;
-   /*
-*  The subtraction is done between two unsigned 32bits value
-* (the result is always modulo 32 bits even if we have
-* *old_head > cons_tail). So 'free_entries' is always between 0
-* and capacity (which is < size).
-*/
-   *free_entries = (capacity + cons_tail - *old_head);
-
-   /* check that we have enough room in ring */
-   if (unlikely(n > *free_entries))
-

[dpdk-dev] [PATCH V6 3/3] ring: introduce new header file to support C11 memory model

2017-11-26 Thread Jia He

To support C11 memory model barrier, 2 options are suggested by Jerin:
1. use rte_smp_rmb
2. use load_acquire/store_release(refer to [1]).
CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "y"
only on arm64 so far.

The reason why providing 2 options is due to the performance benchmark
difference in different arm machines, refer to [2].

We haven't tested on ppc64. If anyone verifies it, he can add
CONFIG_RTE_RING_USE_C11_MEM_MODEL=y to ppc64 config files.

[1] https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170
[2] http://dpdk.org/ml/archives/dev/2017-October/080861.html

Signed-off-by: Jia He 
Suggested-by: Jerin Jacob 
---
 config/common_armv8a_linuxapp  |   2 +
 lib/librte_ring/Makefile   |   3 +-
 lib/librte_ring/rte_ring.h |  14 ++-
 lib/librte_ring/rte_ring_c11_mem.h | 186 +
 4 files changed, 203 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_c11_mem.h

diff --git a/config/common_armv8a_linuxapp b/config/common_armv8a_linuxapp
index 6732d1e..5b7b2eb 100644
--- a/config/common_armv8a_linuxapp
+++ b/config/common_armv8a_linuxapp
@@ -49,3 +49,5 @@ CONFIG_RTE_LIBRTE_SFC_EFX_PMD=n
 CONFIG_RTE_LIBRTE_AVP_PMD=n
 
 CONFIG_RTE_SCHED_VECTOR=n
+
+CONFIG_RTE_RING_USE_C11_MEM_MODEL=y
diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index c959945..a2682e7 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -46,6 +46,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
 
 # install includes
 SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
-   rte_ring_generic.h
+   rte_ring_generic.h \
+   rte_ring_c11_mem.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 519614c..3343eba 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -356,8 +356,20 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
} \
 } while (0)
 
-/* Move common functions to generic file */
+/* Between load and load. there might be cpu reorder in weak model
+ * (powerpc/arm).
+ * There are 2 choices for the users
+ * 1.use rmb() memory barrier
+ * 2.use one-direcion load_acquire/store_release barrier,defined by
+ * CONFIG_RTE_RING_USE_C11_MEM_MODEL=y
+ * It depends on performance test results.
+ * By default, move common functions to rte_ring_generic.h
+ */
+#ifdef RTE_RING_USE_C11_MEM_MODEL
+#include "rte_ring_c11_mem.h"
+#else
 #include "rte_ring_generic.h"
+#endif
 
 /**
  * @internal Enqueue several objects on the ring
diff --git a/lib/librte_ring/rte_ring_c11_mem.h 
b/lib/librte_ring/rte_ring_c11_mem.h
new file mode 100644
index 000..ca21e95
--- /dev/null
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -0,0 +1,186 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2017 hxt-semitech. All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ *   notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ *   notice, this list of conditions and the following disclaimer in
+ *   the documentation and/or other materials provided with the
+ *   distribution.
+ * * Neither the name of hxt-semitech nor the names of its
+ *   contributors may be used to endorse or promote products derived
+ *   from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_RING_C11_MEM_H_
+#define _RTE_RING_C11_MEM_H_
+
+static __rte_always_inline void
+update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
+   uint32_t single, uint32_t enqueue)
+{
+   RTE_SET_USED(enqueue);
+
+   /*
+* If there are other enqueues/dequeues in progress that preceded us,
+* we need to wai

[dpdk-dev] [PATCH V7 1/3] eal/arm64: remove the braces {} for dmb() and dsb()

2017-12-03 Thread Jia He

for the code as follows:
if (condition)
rte_smp_rmb();
else
rte_smp_wmb();
Without this patch, compiler will report this error:
error: 'else' without a previous 'if'

Fixes: 84733fd0d75e ("eal/arm64: fix memory barrier definition")
Cc: sta...@dpdk.org
Signed-off-by: Jia He 
Acked-by: Jerin Jacob 
---
 lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h 
b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
index 0b70d62..71da29c 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
@@ -43,8 +43,8 @@ extern "C" {
 
 #include "generic/rte_atomic.h"
 
-#define dsb(opt)  { asm volatile("dsb " #opt : : : "memory"); }
-#define dmb(opt)  { asm volatile("dmb " #opt : : : "memory"); }
+#define dsb(opt) asm volatile("dsb " #opt : : : "memory")
+#define dmb(opt) asm volatile("dmb " #opt : : : "memory")
 
 #define rte_mb() dsb(sy)
 
-- 
2.7.4

[dpdk-dev] [PATCH V7 0/3] support c11 memory model barrier in librte_ring

2017-12-03 Thread Jia He

To support C11 memory model barrier, 2 options are suggested by Jerin:
1. use rte_smp_rmb
2. use load_acquire/store_release
CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "y"
only on arm64 so far.

The reason why providing 2 options is due to the performance benchmark
difference in different arm machines.

Already fuctionally tested on the machines as follows:
- on X86
- on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=y
- on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=n

---
Changelog:
V7: fix check-git-log warnings which is suggested by Jerin
V6: minor change in subject and log
V5: split it into 2 patchset due to the milestone concerns, this is the 2st 
one. Also fix checkpatch.pl warnings
V4: split into small patches
V3: arch specific implementation for enqueue/dequeue barrier
V2: let users choose whether using load_acquire/store_release
V1: rte_smp_rmb() between 2 loads

Jia He (3):
  eal/arm64: remove the braces {} for dmb() and dsb()
  ring: introduce new header file to include common functions
  ring: introduce new header file to support C11 memory model

 config/common_armv8a_linuxapp  |   2 +
 .../common/include/arch/arm/rte_atomic_64.h|   4 +-
 lib/librte_eventdev/rte_event_ring.h   |   6 +-
 lib/librte_ring/Makefile   |   4 +-
 lib/librte_ring/rte_ring.h | 173 ++
 lib/librte_ring/rte_ring_c11_mem.h | 186 
 lib/librte_ring/rte_ring_generic.h | 195 +
 7 files changed, 406 insertions(+), 164 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_c11_mem.h
 create mode 100644 lib/librte_ring/rte_ring_generic.h

-- 
2.7.4

[dpdk-dev] [PATCH V7 3/3] ring: introduce new header file to support C11 memory model

2017-12-03 Thread Jia He

To support C11 memory model barrier, 2 options are suggested by Jerin:
1. use rte_smp_rmb
2. use load_acquire/store_release(refer to [1]).
CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "y"
only on arm64 so far.

The reason why providing 2 options is due to the performance benchmark
difference in different arm machines, refer to [2].

We haven't tested on ppc64. If anyone verifies it, he can add
CONFIG_RTE_RING_USE_C11_MEM_MODEL=y to ppc64 config files.

[1] https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170
[2] http://dpdk.org/ml/archives/dev/2017-October/080861.html

Signed-off-by: Jia He 
Suggested-by: Jerin Jacob 
Acked-by: Jerin Jacob 
---
 config/common_armv8a_linuxapp  |   2 +
 lib/librte_ring/Makefile   |   3 +-
 lib/librte_ring/rte_ring.h |  14 ++-
 lib/librte_ring/rte_ring_c11_mem.h | 186 +
 4 files changed, 203 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_c11_mem.h

diff --git a/config/common_armv8a_linuxapp b/config/common_armv8a_linuxapp
index 6732d1e..5b7b2eb 100644
--- a/config/common_armv8a_linuxapp
+++ b/config/common_armv8a_linuxapp
@@ -49,3 +49,5 @@ CONFIG_RTE_LIBRTE_SFC_EFX_PMD=n
 CONFIG_RTE_LIBRTE_AVP_PMD=n
 
 CONFIG_RTE_SCHED_VECTOR=n
+
+CONFIG_RTE_RING_USE_C11_MEM_MODEL=y
diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index c959945..a2682e7 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -46,6 +46,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
 
 # install includes
 SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
-   rte_ring_generic.h
+   rte_ring_generic.h \
+   rte_ring_c11_mem.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 519614c..3343eba 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -356,8 +356,20 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
} \
 } while (0)
 
-/* Move common functions to generic file */
+/* Between load and load. there might be cpu reorder in weak model
+ * (powerpc/arm).
+ * There are 2 choices for the users
+ * 1.use rmb() memory barrier
+ * 2.use one-direcion load_acquire/store_release barrier,defined by
+ * CONFIG_RTE_RING_USE_C11_MEM_MODEL=y
+ * It depends on performance test results.
+ * By default, move common functions to rte_ring_generic.h
+ */
+#ifdef RTE_RING_USE_C11_MEM_MODEL
+#include "rte_ring_c11_mem.h"
+#else
 #include "rte_ring_generic.h"
+#endif
 
 /**
  * @internal Enqueue several objects on the ring
diff --git a/lib/librte_ring/rte_ring_c11_mem.h 
b/lib/librte_ring/rte_ring_c11_mem.h
new file mode 100644
index 000..ca21e95
--- /dev/null
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -0,0 +1,186 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2017 hxt-semitech. All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ *   notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ *   notice, this list of conditions and the following disclaimer in
+ *   the documentation and/or other materials provided with the
+ *   distribution.
+ * * Neither the name of hxt-semitech nor the names of its
+ *   contributors may be used to endorse or promote products derived
+ *   from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_RING_C11_MEM_H_
+#define _RTE_RING_C11_MEM_H_
+
+static __rte_always_inline void
+update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
+   uint32_t single, uint32_t enqueue)
+{
+   RTE_SET_USED(enqueue);
+
+   /*
+* If there are other enqueues/dequeues in progress that preceded us,
+

[dpdk-dev] [PATCH V7 2/3] ring: introduce new header file to include common functions

2017-12-03 Thread Jia He

move the common part of rte_ring.h into rte_ring_generic.h.
move the memory barrier part into update_tail().

no functional changes here.

Signed-off-by: Jia He 
Suggested-by: Jerin Jacob 
Suggested-by: Ananyev Konstantin 
Acked-by: Jerin Jacob 
---
 lib/librte_eventdev/rte_event_ring.h |   6 +-
 lib/librte_ring/Makefile |   3 +-
 lib/librte_ring/rte_ring.h   | 161 +
 lib/librte_ring/rte_ring_generic.h   | 195 +++
 4 files changed, 203 insertions(+), 162 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_generic.h

diff --git a/lib/librte_eventdev/rte_event_ring.h 
b/lib/librte_eventdev/rte_event_ring.h
index ea9b688..3e49458 100644
--- a/lib/librte_eventdev/rte_event_ring.h
+++ b/lib/librte_eventdev/rte_event_ring.h
@@ -126,9 +126,8 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r,
goto end;
 
ENQUEUE_PTRS(&r->r, &r[1], prod_head, events, n, struct rte_event);
-   rte_smp_wmb();
 
-   update_tail(&r->r.prod, prod_head, prod_next, 1);
+   update_tail(&r->r.prod, prod_head, prod_next, 1, 1);
 end:
if (free_space != NULL)
*free_space = free_entries - n;
@@ -168,9 +167,8 @@ rte_event_ring_dequeue_burst(struct rte_event_ring *r,
goto end;
 
DEQUEUE_PTRS(&r->r, &r[1], cons_head, events, n, struct rte_event);
-   rte_smp_rmb();
 
-   update_tail(&r->r.cons, cons_head, cons_next, 1);
+   update_tail(&r->r.cons, cons_head, cons_next, 1, 0);
 
 end:
if (available != NULL)
diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index e34d9d9..c959945 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -45,6 +45,7 @@ LIBABIVER := 1
 SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
 
 # install includes
-SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
+   rte_ring_generic.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index e924438..519614c 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -356,91 +356,8 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
} \
 } while (0)
 
-static __rte_always_inline void
-update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
-   uint32_t single)
-{
-   /*
-* If there are other enqueues/dequeues in progress that preceded us,
-* we need to wait for them to complete
-*/
-   if (!single)
-   while (unlikely(ht->tail != old_val))
-   rte_pause();
-
-   ht->tail = new_val;
-}
-
-/**
- * @internal This function updates the producer head for enqueue
- *
- * @param r
- *   A pointer to the ring structure
- * @param is_sp
- *   Indicates whether multi-producer path is needed or not
- * @param n
- *   The number of elements we will want to enqueue, i.e. how far should the
- *   head be moved
- * @param behavior
- *   RTE_RING_QUEUE_FIXED:Enqueue a fixed number of items from a ring
- *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
- * @param old_head
- *   Returns head value as it was before the move, i.e. where enqueue starts
- * @param new_head
- *   Returns the current/new head value i.e. where enqueue finishes
- * @param free_entries
- *   Returns the amount of free space in the ring BEFORE head was moved
- * @return
- *   Actual number of objects enqueued.
- *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
- */
-static __rte_always_inline unsigned int
-__rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
-   unsigned int n, enum rte_ring_queue_behavior behavior,
-   uint32_t *old_head, uint32_t *new_head,
-   uint32_t *free_entries)
-{
-   const uint32_t capacity = r->capacity;
-   unsigned int max = n;
-   int success;
-
-   do {
-   /* Reset n to the initial burst count */
-   n = max;
-
-   *old_head = r->prod.head;
-
-   /* add rmb barrier to avoid load/load reorder in weak
-* memory model. It is noop on x86
-*/
-   rte_smp_rmb();
-
-   const uint32_t cons_tail = r->cons.tail;
-   /*
-*  The subtraction is done between two unsigned 32bits value
-* (the result is always modulo 32 bits even if we have
-* *old_head > cons_tail). So 'free_entries' is always between 0
-* and capacity (which is < size).
-*/
-   *free_entries = (capacity + cons_tail - *old_head);
-
-   /* check that we have enough room in ring */
-   if (unlikely(

[dpdk-dev] About pmu cycle counter usage in armv8

2017-12-10 Thread Jia He


Hi Jerin

In [1], I met a pmu cycle counter problem (all return value is 0) 
occasionally.


And then I submited a patch to kernel maillist, but was rejected by 
maintainer at last [2].


He said:

"We only intend for the in-kernel perf infrastructure to access

pmccntr_el0; nothing else should touch it."

So maybe it is not proper for dpdk to use pmu cycle counter?

[1] http://dpdk.org/ml/archives/dev/2017-November/080998.html

[2]https://lkml.org/lkml/2017/11/16/22

--
Cheers,
Jia

Re: [dpdk-dev] About pmu cycle counter usage in armv8

2017-12-10 Thread Jia He


Hi Jerin

Ok

And I wonder why you haven't met such problem (all rd_tsc() is 0)in your 
test platform?


Did you use an old kernel (older than v4.5-rc1)?

root@aw-host:~/linux# git describe da4e4f18afe0
v4.5-rc1-8-gda4e4f1

Maybe you need to give a warning that, the usage of High-resolution 
cycle counter is not

correct if the kernel version is newer than v4.5-rc1-8-gda4e4f1?

Cheers,
Jia
On 12/11/2017 1:59 PM, Jerin Jacob Wrote:

-Original Message-

Date: Mon, 11 Dec 2017 13:38:25 +0800
From: Jia He 
To: Jerin Jacob , "dev@dpdk.org"
  
Subject: About pmu cycle counter usage in armv8
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
  Thunderbird/52.5.0

Hi Jerin

In [1], I met a pmu cycle counter problem (all return value is 0)
occasionally.

And then I submited a patch to kernel maillist, but was rejected by
maintainer at last [2].

He said:

"We only intend for the in-kernel perf infrastructure to access

pmccntr_el0; nothing else should touch it."

Yes. That's the reason why
1) A warning added in documentation.
http://dpdk.org/doc/guides/prog_guide/profile_app.html
See at last
"
The PMU based scheme is useful for high accuracy performance profiling
with rte_rdtsc(). However, this method can not be used in conjunction
with Linux userspace profiling tools like perf as this scheme alters the
PMU registers state.
"
2) By default it is disabled and not need for production systems.
Needed only for performance debugging.


So maybe it is not proper for dpdk to use pmu cycle counter?

But, There is no alternative in arm64 to get high resolution counter in
user space(in performance effective way)


[1] http://dpdk.org/ml/archives/dev/2017-November/080998.html

[2]https://lkml.org/lkml/2017/11/16/22

--
Cheers,
Jia

68 matches

Mail list logo