Re: [dpdk-dev] [PATCH V7 2/3] ring: introduce new header file to include common functions
On 1/13/2018 1:09 AM, Thomas Monjalon Wrote: 04/12/2017 02:50, Jia He: move the common part of rte_ring.h into rte_ring_generic.h. move the memory barrier part into update_tail(). no functional changes here. [...] --- /dev/null +++ b/lib/librte_ring/rte_ring_generic.h @@ -0,0 +1,195 @@ +/*- + * BSD LICENSE + * + * Copyright(c) 2017 hxt-semitech. All rights reserved. As you are moving existing code, I think you should keep the original copyright. Ok, thanks for the comments -- Cheers, Jia
[dpdk-dev] [PATCH v8 0/3] support c11 memory model barrier in librte_ring
To support C11 memory model barrier, 2 options are suggested by Jerin: 1. use rte_smp_rmb 2. use load_acquire/store_release CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "n" on any architectures so far. The reason why providing 2 options is due to the performance benchmark difference in different arm machines. Already fuctionally tested on the machines as follows: - on X86 - on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=y - on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=n --- Changelog: V8: Change the lincense to SPDX tag. Change USE_C11_MEM_MODEL to "n" on any architectures by default V7: fix check-git-log warnings which is suggested by Jerin V6: minor change in subject and log V5: split it into 2 patchset due to the milestone concerns, this is the 2st one. Also fix checkpatch.pl warnings V4: split into small patches V3: arch specific implementation for enqueue/dequeue barrier V2: let users choose whether using load_acquire/store_release V1: rte_smp_rmb() between 2 loads Jia He (3): eal/arm64: remove the braces {} for dmb() and dsb() ring: introduce new header file to include common functions ring: introduce new header file to support C11 memory model config/common_linuxapp | 2 + .../common/include/arch/arm/rte_atomic_64.h| 4 +- lib/librte_eventdev/rte_event_ring.h | 6 +- lib/librte_ring/Makefile | 4 +- lib/librte_ring/rte_ring.h | 173 ++ lib/librte_ring/rte_ring_c11_mem.h | 193 lib/librte_ring/rte_ring_generic.h | 202 + 7 files changed, 420 insertions(+), 164 deletions(-) create mode 100644 lib/librte_ring/rte_ring_c11_mem.h create mode 100644 lib/librte_ring/rte_ring_generic.h -- 2.7.4
[dpdk-dev] [PATCH v8 1/3] eal/arm64: remove the braces {} for dmb() and dsb()
for the code as follows: if (condition) rte_smp_rmb(); else rte_smp_wmb(); Without this patch, compiler will report this error: error: 'else' without a previous 'if' Fixes: 84733fd0d75e ("eal/arm64: fix memory barrier definition") Signed-off-by: Jia He Acked-by: Jerin Jacob --- lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h index b6bbd0b..10ccf14 100644 --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h @@ -15,8 +15,8 @@ extern "C" { #include "generic/rte_atomic.h" -#define dsb(opt) { asm volatile("dsb " #opt : : : "memory"); } -#define dmb(opt) { asm volatile("dmb " #opt : : : "memory"); } +#define dsb(opt) asm volatile("dsb " #opt : : : "memory") +#define dmb(opt) asm volatile("dmb " #opt : : : "memory") #define rte_mb() dsb(sy) -- 2.7.4
[dpdk-dev] [PATCH v8 2/3] ring: introduce new header file to include common functions
Move the common part of rte_ring.h into rte_ring_generic.h. Move the memory barrier part into update_tail(). No functional changes here. Signed-off-by: Jia He Suggested-by: Jerin Jacob Suggested-by: Ananyev Konstantin Acked-by: Jerin Jacob Acked-by: Olivier Matz --- lib/librte_eventdev/rte_event_ring.h | 6 +- lib/librte_ring/Makefile | 3 +- lib/librte_ring/rte_ring.h | 161 +--- lib/librte_ring/rte_ring_generic.h | 202 +++ 4 files changed, 210 insertions(+), 162 deletions(-) create mode 100644 lib/librte_ring/rte_ring_generic.h diff --git a/lib/librte_eventdev/rte_event_ring.h b/lib/librte_eventdev/rte_event_ring.h index 480d1d0..29d4228 100644 --- a/lib/librte_eventdev/rte_event_ring.h +++ b/lib/librte_eventdev/rte_event_ring.h @@ -98,9 +98,8 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r, goto end; ENQUEUE_PTRS(&r->r, &r[1], prod_head, events, n, struct rte_event); - rte_smp_wmb(); - update_tail(&r->r.prod, prod_head, prod_next, 1); + update_tail(&r->r.prod, prod_head, prod_next, 1, 1); end: if (free_space != NULL) *free_space = free_entries - n; @@ -140,9 +139,8 @@ rte_event_ring_dequeue_burst(struct rte_event_ring *r, goto end; DEQUEUE_PTRS(&r->r, &r[1], cons_head, events, n, struct rte_event); - rte_smp_rmb(); - update_tail(&r->r.cons, cons_head, cons_next, 1); + update_tail(&r->r.cons, cons_head, cons_next, 1, 0); end: if (available != NULL) diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile index f5f0d35..281ffb7 100644 --- a/lib/librte_ring/Makefile +++ b/lib/librte_ring/Makefile @@ -17,6 +17,7 @@ LIBABIVER := 1 SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c # install includes -SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h +SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \ + rte_ring_generic.h include $(RTE_SDK)/mk/rte.lib.mk diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index e924438..519614c 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -356,91 +356,8 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r); } \ } while (0) -static __rte_always_inline void -update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val, - uint32_t single) -{ - /* -* If there are other enqueues/dequeues in progress that preceded us, -* we need to wait for them to complete -*/ - if (!single) - while (unlikely(ht->tail != old_val)) - rte_pause(); - - ht->tail = new_val; -} - -/** - * @internal This function updates the producer head for enqueue - * - * @param r - * A pointer to the ring structure - * @param is_sp - * Indicates whether multi-producer path is needed or not - * @param n - * The number of elements we will want to enqueue, i.e. how far should the - * head be moved - * @param behavior - * RTE_RING_QUEUE_FIXED:Enqueue a fixed number of items from a ring - * RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring - * @param old_head - * Returns head value as it was before the move, i.e. where enqueue starts - * @param new_head - * Returns the current/new head value i.e. where enqueue finishes - * @param free_entries - * Returns the amount of free space in the ring BEFORE head was moved - * @return - * Actual number of objects enqueued. - * If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only. - */ -static __rte_always_inline unsigned int -__rte_ring_move_prod_head(struct rte_ring *r, int is_sp, - unsigned int n, enum rte_ring_queue_behavior behavior, - uint32_t *old_head, uint32_t *new_head, - uint32_t *free_entries) -{ - const uint32_t capacity = r->capacity; - unsigned int max = n; - int success; - - do { - /* Reset n to the initial burst count */ - n = max; - - *old_head = r->prod.head; - - /* add rmb barrier to avoid load/load reorder in weak -* memory model. It is noop on x86 -*/ - rte_smp_rmb(); - - const uint32_t cons_tail = r->cons.tail; - /* -* The subtraction is done between two unsigned 32bits value -* (the result is always modulo 32 bits even if we have -* *old_head > cons_tail). So 'free_entries' is always between 0 -* and capacity (which is < size). -*/ - *free_entries = (capacity + cons_tail - *old_head); - - /* check that we have enough room in ring */ - if (unlikely(
[dpdk-dev] [PATCH v8 3/3] ring: introduce new header file to support C11 memory model
To support C11 memory model barrier, 2 options are suggested by Jerin: 1. use rte_smp_rmb 2. use load_acquire/store_release(refer to [1]). CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "n" on any architectures so far. The reason why providing 2 options is due to the performance benchmark difference in different arm machines, refer to [2]. We haven't tested on ppc64. If anyone verifies it, he can add CONFIG_RTE_RING_USE_C11_MEM_MODEL=y to ppc64 config files. [1] https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170 [2] http://dpdk.org/ml/archives/dev/2017-October/080861.html Signed-off-by: Jia He Suggested-by: Jerin Jacob Acked-by: Jerin Jacob Acked-by: Olivier Matz Acked-by: Jianbo Liu --- config/common_linuxapp | 2 + lib/librte_ring/Makefile | 3 +- lib/librte_ring/rte_ring.h | 14 ++- lib/librte_ring/rte_ring_c11_mem.h | 193 + 4 files changed, 210 insertions(+), 2 deletions(-) create mode 100644 lib/librte_ring/rte_ring_c11_mem.h diff --git a/config/common_linuxapp b/config/common_linuxapp index 74c7d64..999fbaa 100644 --- a/config/common_linuxapp +++ b/config/common_linuxapp @@ -50,3 +50,5 @@ CONFIG_RTE_LIBRTE_AVP_PMD=y CONFIG_RTE_LIBRTE_NFP_PMD=y CONFIG_RTE_LIBRTE_POWER=y CONFIG_RTE_VIRTIO_USER=y + +CONFIG_RTE_RING_USE_C11_MEM_MODEL=n diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile index 281ffb7..bde8907 100644 --- a/lib/librte_ring/Makefile +++ b/lib/librte_ring/Makefile @@ -18,6 +18,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c # install includes SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \ - rte_ring_generic.h + rte_ring_generic.h \ + rte_ring_c11_mem.h include $(RTE_SDK)/mk/rte.lib.mk diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index 519614c..3343eba 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -356,8 +356,20 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r); } \ } while (0) -/* Move common functions to generic file */ +/* Between load and load. there might be cpu reorder in weak model + * (powerpc/arm). + * There are 2 choices for the users + * 1.use rmb() memory barrier + * 2.use one-direcion load_acquire/store_release barrier,defined by + * CONFIG_RTE_RING_USE_C11_MEM_MODEL=y + * It depends on performance test results. + * By default, move common functions to rte_ring_generic.h + */ +#ifdef RTE_RING_USE_C11_MEM_MODEL +#include "rte_ring_c11_mem.h" +#else #include "rte_ring_generic.h" +#endif /** * @internal Enqueue several objects on the ring diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h new file mode 100644 index 000..20b7ff6 --- /dev/null +++ b/lib/librte_ring/rte_ring_c11_mem.h @@ -0,0 +1,193 @@ +/*- + * Copyright(c) 2017 Intel Corporation. All rights reserved. + * All rights reserved. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +/* + * Derived from FreeBSD's bufring.h + * + ** + * + * Copyright (c) 2007-2009 Kip Macy km...@freebsd.org + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, + *this list of conditions and the following disclaimer. + * + * 2. The name of Kip Macy nor the names of other + *contributors may be used to endorse or promote products derived from + *this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS + * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN + * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + ***/ + +#ifndef _RTE_RING_C11_MEM_H_ +#define _RTE_RING_C11_MEM_H_ + +static __rte_always_inline void +update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val, + uint32_t single, uint32_t enqueue) +{ + RTE_SET_U
Re: [dpdk-dev] [PATCH v8 3/3] ring: introduce new header file to support C11 memory model
Hi Thomas On 1/17/2018 4:24 PM, Thomas Monjalon Wrote: 17/01/2018 05:03, Jia He: To support C11 memory model barrier, 2 options are suggested by Jerin: 1. use rte_smp_rmb 2. use load_acquire/store_release(refer to [1]). CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "n" on any architectures so far. In previous patches, it was enabled for ARM. You decided to not enable it at all? Sorry, maybe I misunderstand your previous mail. >This config option should be added in the common file (as disabled). Do you mean CONFIG_RTE_RING_USE_C11_MEM_MODEL=n in comm_base and "y" in armv8 config? Cheers, Jia config/common_linuxapp | 2 + It should be defined in common_base, not common_linuxapp. --- /dev/null +++ b/lib/librte_ring/rte_ring_c11_mem.h @@ -0,0 +1,193 @@ +/*- + * Copyright(c) 2017 Intel Corporation. All rights reserved. + * All rights reserved. + * + * SPDX-License-Identifier: BSD-3-Clause + */ It is not complying with the template. Please check the license/ directory. Why is it Intel Copyright? "All rights reserved" is probably not needed. +/* + * Derived from FreeBSD's bufring.h + * + ** + * + * Copyright (c) 2007-2009 Kip Macy km...@freebsd.org + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, + *this list of conditions and the following disclaimer. + * + * 2. The name of Kip Macy nor the names of other + *contributors may be used to endorse or promote products derived from + *this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS + * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN + * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + ***/ This double license may be an issue. Hemant, comment? -- Cheers, Jia
Re: [dpdk-dev] [PATCH v8 3/3] ring: introduce new header file to support C11 memory model
On 1/19/2018 7:52 AM, Thomas Monjalon Wrote: 17/01/2018 10:09, Thomas Monjalon: 17/01/2018 09:47, Jia He: Hi Thomas On 1/17/2018 4:24 PM, Thomas Monjalon Wrote: 17/01/2018 05:03, Jia He: To support C11 memory model barrier, 2 options are suggested by Jerin: 1. use rte_smp_rmb 2. use load_acquire/store_release(refer to [1]). CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "n" on any architectures so far. In previous patches, it was enabled for ARM. You decided to not enable it at all? Sorry, maybe I misunderstand your previous mail. >This config option should be added in the common file (as disabled). Do you mean CONFIG_RTE_RING_USE_C11_MEM_MODEL=n in comm_base and "y" in armv8 config? Yes, exactly Please, could you send a v9? Thanks This double license may be an issue. Hemant, comment? Ok, should I remove the double license info before Hemant's confirmation? -- Cheers, Jia
Re: [dpdk-dev] [PATCH v8 2/3] ring: introduce new header file to include common functions
Hi Hermant On 1/20/2018 12:47 AM, Hemant Agrawal Wrote: Hi Olivier, On Fri, Jan 19, 2018 at 07:45:30PM +0530, Hemant Agrawal wrote: Hi Jia, On 1/17/2018 9:33 AM, Jia He wrote: Move the common part of rte_ring.h into rte_ring_generic.h. Move the memory barrier part into update_tail(). No functional changes here. Signed-off-by: Jia He Suggested-by: Jerin Jacob Suggested-by: Ananyev Konstantin Acked-by: Jerin Jacob Acked-by: Olivier Matz --- diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h new file mode 100644 index 000..01f2cae --- /dev/null +++ b/lib/librte_ring/rte_ring_generic.h @@ -0,0 +1,202 @@ +/*- + * Copyright(c) 2017 Intel Corporation. All rights reserved. + * All rights reserved. + * + * SPDX-License-Identifier: BSD-3-Clause The SPDX should be first line. See other files for Intel or NXP. [Hemant] Don't add SPDX to this file. This file is not BSD-3 licensed. Please keep the full text as in the original file. + */ + +/* + * Derived from FreeBSD's bufring.h + * + +* ** +*** + * + * Copyright (c) 2007-2009 Kip Macy km...@freebsd.org + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or +without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, + *this list of conditions and the following disclaimer. + * + * 2. The name of Kip Macy nor the names of other + *contributors may be used to endorse or promote products derived from + *this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT +LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A +PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, +OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR +BUSINESS + * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, +WHETHER IN + * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR +OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF +ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + +* ** +/ + This is BSD-2-freebsd, which is not a approved license for DPDK. Can you ask Kip Macy, if he/she is ok to re-license it with BSD-3? Please check with legal, if you can just keep the copyright of Kip Macy and re license it with BSD-3. I see the BSD-3 license to be permissive enough to be re-licensed as BSD-3. But I am not a lawyer. I agree this is something we should do, as a maintainer of librte_ring, I can do it. But here, Jia is just moving code in a new file. I don't think this should block his patchset from beeing included. [Hemant] I thought of blocking this kind of moves, so that we get the license complaint of DPDK faster 😊 Jia, shall keep the original copyrights and headers in this file (i.e. No SPDX). You need to fix it along with rte_ring.h in near future. Regards, Hemant Ok, I will Besides ,I got the allowance from Kip Macy just now. He/She allowed dpdk to license librte_ring.h as BSD-3. My question: >Would you mind allowing dpdk librte_ring.h to be licensed as BSD 3 instead of BSD 2? His/her reply: "I think that's fine. If you're using it be careful I think there's a fix to memory barrier usage needed more relaxed memory models such as ARM. I'll check reviews to see if it made it in or not." -- Cheers, Jia
[dpdk-dev] [PATCH v9 1/3] eal/arm64: remove the braces {} for dmb() and dsb()
for the code as follows: if (condition) rte_smp_rmb(); else rte_smp_wmb(); Without this patch, compiler will report this error: error: 'else' without a previous 'if' Fixes: 84733fd0d75e ("eal/arm64: fix memory barrier definition") Signed-off-by: Jia He Acked-by: Jerin Jacob --- lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h index b6bbd0b..10ccf14 100644 --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h @@ -15,8 +15,8 @@ extern "C" { #include "generic/rte_atomic.h" -#define dsb(opt) { asm volatile("dsb " #opt : : : "memory"); } -#define dmb(opt) { asm volatile("dmb " #opt : : : "memory"); } +#define dsb(opt) asm volatile("dsb " #opt : : : "memory") +#define dmb(opt) asm volatile("dmb " #opt : : : "memory") #define rte_mb() dsb(sy) -- 2.7.4
[dpdk-dev] [PATCH v9 0/3] support c11 memory model barrier in librte_ring
There are 2 model barrier options in librte_ring suggested by Jerin: 1. use rte_smp_rmb 2. use load_acquire/store_release CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided for supporting C11 memory model barrier in librte_ring. By default it is "y" on arm64, "n" on any other architectures so far. Already fuctionally tested on the machines as follows: - on X86 - on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=y - on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=n --- Changelog: V9: remove the SPDX tag and refine commit logs V8: Change the license to SPDX tag. Change USE_C11_MEM_MODEL to "n" on any architectures by default V7: fix check-git-log warnings which is suggested by Jerin V6: minor change in subject and log V5: split it into 2 patchset due to the milestone concerns, this is the 2st one. Also fix checkpatch.pl warnings V4: split into small patches V3: arch specific implementation for enqueue/dequeue barrier V2: let users choose whether using load_acquire/store_release V1: rte_smp_rmb() between 2 loads Jia He (3): eal/arm64: remove the braces {} for dmb() and dsb() ring: introduce new header file to include common functions ring: introduce new header file to support C11 memory model config/common_armv8a_linuxapp | 2 + config/common_base | 5 + .../common/include/arch/arm/rte_atomic_64.h| 4 +- lib/librte_eventdev/rte_event_ring.h | 6 +- lib/librte_ring/Makefile | 4 +- lib/librte_ring/rte_ring.h | 173 ++-- lib/librte_ring/rte_ring_c11_mem.h | 218 lib/librte_ring/rte_ring_generic.h | 228 + 8 files changed, 476 insertions(+), 164 deletions(-) create mode 100644 lib/librte_ring/rte_ring_c11_mem.h create mode 100644 lib/librte_ring/rte_ring_generic.h -- 2.7.4
[dpdk-dev] [PATCH v9 3/3] ring: introduce new header file to support C11 memory model
This patch is to support C11 memory model barrier in librte_ring. There are 2 barrier implementation options in librte_ring (suggested by Jerin). 1. use rte_smp_rmb 2. use load_acquire/store_release(refer to [1]). The reason why providing 2 options is the performance benchmark difference in different arm machines, refer to [2]. CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "n" on any architectures and only "y" on arm64 so far. [1] https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170 [2] http://dpdk.org/ml/archives/dev/2017-October/080861.html Signed-off-by: Jia He Suggested-by: Jerin Jacob Acked-by: Jerin Jacob Acked-by: Olivier Matz Acked-by: Jianbo Liu --- config/common_armv8a_linuxapp | 2 + config/common_base | 5 + lib/librte_ring/Makefile | 3 +- lib/librte_ring/rte_ring.h | 14 ++- lib/librte_ring/rte_ring_c11_mem.h | 218 + 5 files changed, 240 insertions(+), 2 deletions(-) create mode 100644 lib/librte_ring/rte_ring_c11_mem.h diff --git a/config/common_armv8a_linuxapp b/config/common_armv8a_linuxapp index 790e716..0a1f370 100644 --- a/config/common_armv8a_linuxapp +++ b/config/common_armv8a_linuxapp @@ -34,3 +34,5 @@ CONFIG_RTE_LIBRTE_SFC_EFX_PMD=n CONFIG_RTE_LIBRTE_AVP_PMD=n CONFIG_RTE_SCHED_VECTOR=n + +CONFIG_RTE_RING_USE_C11_MEM_MODEL=y diff --git a/config/common_base b/config/common_base index 170a389..0aadda8 100644 --- a/config/common_base +++ b/config/common_base @@ -843,3 +843,8 @@ CONFIG_RTE_APP_CRYPTO_PERF=y # Compile the eventdev application # CONFIG_RTE_APP_EVENTDEV=y + +# +# Use C11 memory model barrier in Librte_ring +# +CONFIG_RTE_RING_USE_C11_MEM_MODEL=n diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile index 281ffb7..bde8907 100644 --- a/lib/librte_ring/Makefile +++ b/lib/librte_ring/Makefile @@ -18,6 +18,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c # install includes SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \ - rte_ring_generic.h + rte_ring_generic.h \ + rte_ring_c11_mem.h include $(RTE_SDK)/mk/rte.lib.mk diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index bc3756a..cbf2f19 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -357,8 +357,20 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r); } \ } while (0) -/* Move common functions to generic file */ +/* Between load and load. there might be cpu reorder in weak model + * (powerpc/arm). + * There are 2 choices for the users + * 1.use rmb() memory barrier + * 2.use one-direcion load_acquire/store_release barrier,defined by + * CONFIG_RTE_RING_USE_C11_MEM_MODEL=y + * It depends on performance test results. + * By default, move common functions to rte_ring_generic.h + */ +#ifdef RTE_RING_USE_C11_MEM_MODEL +#include "rte_ring_c11_mem.h" +#else #include "rte_ring_generic.h" +#endif /** * @internal Enqueue several objects on the ring diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h new file mode 100644 index 000..8ccb937 --- /dev/null +++ b/lib/librte_ring/rte_ring_c11_mem.h @@ -0,0 +1,218 @@ +/*- + * BSD LICENSE + * + * Copyright(c) 2017 HXT-semitech Corporation. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * * Neither the name of HXT-semitech Corporation nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SO
[dpdk-dev] [PATCH v9 2/3] ring: introduce new header file to include common functions
Move the common part of rte_ring.h into rte_ring_generic.h. Move the memory barrier part into update_tail(). No functional changes here. Signed-off-by: Jia He Suggested-by: Jerin Jacob Suggested-by: Ananyev Konstantin Acked-by: Jerin Jacob Acked-by: Olivier Matz --- lib/librte_eventdev/rte_event_ring.h | 6 +- lib/librte_ring/Makefile | 3 +- lib/librte_ring/rte_ring.h | 161 + lib/librte_ring/rte_ring_generic.h | 228 +++ 4 files changed, 236 insertions(+), 162 deletions(-) create mode 100644 lib/librte_ring/rte_ring_generic.h diff --git a/lib/librte_eventdev/rte_event_ring.h b/lib/librte_eventdev/rte_event_ring.h index 480d1d0..29d4228 100644 --- a/lib/librte_eventdev/rte_event_ring.h +++ b/lib/librte_eventdev/rte_event_ring.h @@ -98,9 +98,8 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r, goto end; ENQUEUE_PTRS(&r->r, &r[1], prod_head, events, n, struct rte_event); - rte_smp_wmb(); - update_tail(&r->r.prod, prod_head, prod_next, 1); + update_tail(&r->r.prod, prod_head, prod_next, 1, 1); end: if (free_space != NULL) *free_space = free_entries - n; @@ -140,9 +139,8 @@ rte_event_ring_dequeue_burst(struct rte_event_ring *r, goto end; DEQUEUE_PTRS(&r->r, &r[1], cons_head, events, n, struct rte_event); - rte_smp_rmb(); - update_tail(&r->r.cons, cons_head, cons_next, 1); + update_tail(&r->r.cons, cons_head, cons_next, 1, 0); end: if (available != NULL) diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile index f5f0d35..281ffb7 100644 --- a/lib/librte_ring/Makefile +++ b/lib/librte_ring/Makefile @@ -17,6 +17,7 @@ LIBABIVER := 1 SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c # install includes -SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h +SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \ + rte_ring_generic.h include $(RTE_SDK)/mk/rte.lib.mk diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index 7069d52..bc3756a 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -357,91 +357,8 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r); } \ } while (0) -static __rte_always_inline void -update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val, - uint32_t single) -{ - /* -* If there are other enqueues/dequeues in progress that preceded us, -* we need to wait for them to complete -*/ - if (!single) - while (unlikely(ht->tail != old_val)) - rte_pause(); - - ht->tail = new_val; -} - -/** - * @internal This function updates the producer head for enqueue - * - * @param r - * A pointer to the ring structure - * @param is_sp - * Indicates whether multi-producer path is needed or not - * @param n - * The number of elements we will want to enqueue, i.e. how far should the - * head be moved - * @param behavior - * RTE_RING_QUEUE_FIXED:Enqueue a fixed number of items from a ring - * RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring - * @param old_head - * Returns head value as it was before the move, i.e. where enqueue starts - * @param new_head - * Returns the current/new head value i.e. where enqueue finishes - * @param free_entries - * Returns the amount of free space in the ring BEFORE head was moved - * @return - * Actual number of objects enqueued. - * If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only. - */ -static __rte_always_inline unsigned int -__rte_ring_move_prod_head(struct rte_ring *r, int is_sp, - unsigned int n, enum rte_ring_queue_behavior behavior, - uint32_t *old_head, uint32_t *new_head, - uint32_t *free_entries) -{ - const uint32_t capacity = r->capacity; - unsigned int max = n; - int success; - - do { - /* Reset n to the initial burst count */ - n = max; - - *old_head = r->prod.head; - - /* add rmb barrier to avoid load/load reorder in weak -* memory model. It is noop on x86 -*/ - rte_smp_rmb(); - - const uint32_t cons_tail = r->cons.tail; - /* -* The subtraction is done between two unsigned 32bits value -* (the result is always modulo 32 bits even if we have -* *old_head > cons_tail). So 'free_entries' is always between 0 -* and capacity (which is < size). -*/ - *free_entries = (capacity + cons_tail - *old_head); - - /* check that we have enough room in ring */ - if (unlikely(
Re: [dpdk-dev] [PATCH v8 2/3] ring: introduce new header file to include common functions
Hi hermant On 1/22/2018 1:15 PM, Hemant Agrawal Wrote: Hi Jia, On 1/22/2018 7:23 AM, Jia He wrote: This is BSD-2-freebsd, which is not a approved license for DPDK. Can you ask Kip Macy, if he/she is ok to re-license it with BSD-3? Please check with legal, if you can just keep the copyright of Kip Macy and re license it with BSD-3. I see the BSD-3 license to be permissive enough to be re-licensed as BSD-3. But I am not a lawyer. I agree this is something we should do, as a maintainer of librte_ring, I can do it. But here, Jia is just moving code in a new file. I don't think this should block his patchset from beeing included. [Hemant] I thought of blocking this kind of moves, so that we get the license complaint of DPDK faster 😊 Jia, shall keep the original copyrights and headers in this file (i.e. No SPDX). You need to fix it along with rte_ring.h in near future. Regards, Hemant Ok, I will Besides ,I got the allowance from Kip Macy just now. He/She allowed dpdk to license librte_ring.h as BSD-3. My question: Would you mind allowing dpdk librte_ring.h to be licensed as BSD 3 instead of BSD 2? His/her reply: "I think that's fine. If you're using it be careful I think there's a fix to memory barrier usage needed more relaxed memory models such as ARM. I'll check reviews to see if it made it in or not." That is good. Your Patch v9 looks ok. Will you please also add another patch for following: (all files in librte_ring - having BSD-2 license) /* SPDX-License-Identifier: BSD-3-Clause * * Copyright . (Intel or your company) * Copyright (c) 2007-2009 Kip Macy km...@freebsd.org * All rights reserved. * Derived from FreeBSD's bufring.h * Used as BSD-3 Licensed with permission from Kip Macy. */ With pleasure, thanks -- Cheers, Jia
[dpdk-dev] [PATCH] ring: convert license headers to SPDX tags
Signed-off-by: Jia He --- lib/librte_ring/rte_ring.c | 66 +++--- lib/librte_ring/rte_ring.h | 66 +++--- lib/librte_ring/rte_ring_c11_mem.h | 65 +++-- lib/librte_ring/rte_ring_generic.h | 66 +++--- 4 files changed, 20 insertions(+), 243 deletions(-) diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c index 036467f..d215ace 100644 --- a/lib/librte_ring/rte_ring.c +++ b/lib/librte_ring/rte_ring.c @@ -1,67 +1,11 @@ -/*- - * BSD LICENSE - * - * Copyright(c) 2010-2015 Intel Corporation. All rights reserved. - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions - * are met: - * - * * Redistributions of source code must retain the above copyright - * notice, this list of conditions and the following disclaimer. - * * Redistributions in binary form must reproduce the above copyright - * notice, this list of conditions and the following disclaimer in - * the documentation and/or other materials provided with the - * distribution. - * * Neither the name of Intel Corporation nor the names of its - * contributors may be used to endorse or promote products derived - * from this software without specific prior written permission. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS - * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT - * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR - * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT - * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, - * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT - * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, - * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY - * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE - * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - */ - -/* - * Derived from FreeBSD's bufring.c - * - ** +/* SPDX-License-Identifier: BSD-3-Clause * + * Copyright (c) 2010-2015 Intel Corporation * Copyright (c) 2007,2008 Kip Macy km...@freebsd.org * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * - * 1. Redistributions of source code must retain the above copyright notice, - *this list of conditions and the following disclaimer. - * - * 2. The name of Kip Macy nor the names of other - *contributors may be used to endorse or promote products derived from - *this software without specific prior written permission. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" - * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE - * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE - * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE - * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR - * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF - * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS - * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN - * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) - * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE - * POSSIBILITY OF SUCH DAMAGE. - * - ***/ + * Derived from FreeBSD's bufring.h + * Used as BSD-3 Licensed with permission from Kip Macy. + */ #include #include diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index cbf2f19..253cdc9 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -1,67 +1,11 @@ -/*- - * BSD LICENSE - * - * Copyright(c) 2010-2017 Intel Corporation. All rights reserved. - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions - * are met: - * - * * Redistributions of source code must retain the above copyright - * notice, this list of conditions and the following disclaimer. - * * Redistributions in binary form must reproduce the above copyright - * notice, this list of conditions and the following disclaimer in - * the documentation and/or other materials provided with the - * distributi
[dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue
Before this patch: In __rte_ring_move_cons_head() ... do { /* Restore n as it may change every loop */ n = max; *old_head = r->cons.head;//1st load const uint32_t prod_tail = r->prod.tail; //2nd load In weak memory order architectures(powerpc,arm), the 2nd load might be reodered before the 1st load, that makes *entries is bigger than we wanted. This nasty reording messed enque/deque up. cpu1(producer) cpu2(consumer) cpu3(consumer) load r->prod.tail in enqueue: load r->cons.tail load r->prod.head store r->prod.tail load r->cons.head load r->prod.tail ... store r->cons.{head,tail} load r->cons.head THEN,r->cons.head will be bigger than prod_tail, then make *entries very big After this patch, the old cons.head will be recaculated after failure of rte_atomic32_cmpset There is no such issue in X86 cpu, because X86 is strong memory order model Signed-off-by: Jia He Signed-off-by: jia...@hxt-semitech.com Signed-off-by: jie2@hxt-semitech.com Signed-off-by: bing.z...@hxt-semitech.com --- lib/librte_ring/rte_ring.h | 8 1 file changed, 8 insertions(+) diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index 5e9b3b7..15c72e2 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -409,6 +409,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, n = max; *old_head = r->prod.head; + + /* load of prod.tail can't be reordered before cons.head */ + rte_smp_rmb(); + const uint32_t cons_tail = r->cons.tail; /* * The subtraction is done between two unsigned 32bits value @@ -517,6 +521,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, n = max; *old_head = r->cons.head; + + /* load of prod.tail can't be reordered before cons.head */ + rte_smp_rmb(); + const uint32_t prod_tail = r->prod.tail; /* The subtraction is done between two unsigned 32bits value * (the result is always modulo 32 bits even if we have -- 2.7.4
Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue
Hi Jerin On 10/13/2017 1:23 AM, Jerin Jacob Wrote: -Original Message- Date: Thu, 12 Oct 2017 17:05:50 + From: "Ananyev, Konstantin" To: Olivier MATZ , Jia He CC: "dev@dpdk.org" , "jia...@hxt-semitech.com" , "jie2@hxt-semitech.com" , "bing.z...@hxt-semitech.com" , "jerin.ja...@caviumnetworks.com" Subject: RE: [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue Hi guys, -Original Message- From: Olivier MATZ [mailto:olivier.m...@6wind.com] Sent: Thursday, October 12, 2017 4:54 PM To: Jia He Cc: dev@dpdk.org; jia...@hxt-semitech.com; jie2@hxt-semitech.com; bing.z...@hxt-semitech.com; Ananyev, Konstantin ; jerin.ja...@caviumnetworks.com Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue Hi, On Tue, Oct 10, 2017 at 05:56:36PM +0800, Jia He wrote: Before this patch: In __rte_ring_move_cons_head() ... do { /* Restore n as it may change every loop */ n = max; *old_head = r->cons.head;//1st load const uint32_t prod_tail = r->prod.tail; //2nd load In weak memory order architectures(powerpc,arm), the 2nd load might be reodered before the 1st load, that makes *entries is bigger than we wanted. This nasty reording messed enque/deque up. cpu1(producer) cpu2(consumer) cpu3(consumer) load r->prod.tail in enqueue: load r->cons.tail load r->prod.head store r->prod.tail load r->cons.head load r->prod.tail ... store r->cons.{head,tail} load r->cons.head THEN,r->cons.head will be bigger than prod_tail, then make *entries very big After this patch, the old cons.head will be recaculated after failure of rte_atomic32_cmpset There is no such issue in X86 cpu, because X86 is strong memory order model Signed-off-by: Jia He Signed-off-by: jia...@hxt-semitech.com Signed-off-by: jie2@hxt-semitech.com Signed-off-by: bing.z...@hxt-semitech.com --- lib/librte_ring/rte_ring.h | 8 1 file changed, 8 insertions(+) diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index 5e9b3b7..15c72e2 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -409,6 +409,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, n = max; *old_head = r->prod.head; + + /* load of prod.tail can't be reordered before cons.head */ + rte_smp_rmb(); + const uint32_t cons_tail = r->cons.tail; /* * The subtraction is done between two unsigned 32bits value @@ -517,6 +521,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, n = max; *old_head = r->cons.head; + + /* load of prod.tail can't be reordered before cons.head */ + rte_smp_rmb(); + const uint32_t prod_tail = r->prod.tail; /* The subtraction is done between two unsigned 32bits value * (the result is always modulo 32 bits even if we have -- 2.7.4 The explanation convinces me. However, since it's in a critical path, it would be good to have other opinions. This patch reminds me this discussion, that was also related to memory barrier, but at another place: http://dpdk.org/ml/archives/dev/2016-July/043765.html Lead to that patch: http://dpdk.org/browse/dpdk/commit/?id=ecc7d10e448e But finally reverted: http://dpdk.org/browse/dpdk/commit/?id=c3acd92746c3 Konstatin, Jerin, do you have any comment? For IA, as rte_smp_rmb() is just a compiler_barrier, that patch shouldn't make any difference, but I can't see how read reordering would screw things up here... Probably just me and arm or ppc guys could explain what will be the problem if let say cons.tail will be read before prod.head in __rte_ring_move_prod_head(). I wonder Is there a simple test-case to reproduce that problem (on arm or ppc)? Probably new test-case for rte_ring autotest is needed, or is it possible to reproduce it with existing one? On the same lines, Jia He, jie2.liu, bing.zhao, Is this patch based on code review or do you saw this issue on any of the arm/ppc target? arm64 will have performance impact with this change. Based on mbuf_autotest, the rte_panic will be invoked in seconds. PANIC in test_refcnt_iter(): (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free 1: [./test(rte_dump_stack+0x38) [0x58d868]] Aborted (core dumped) Cheers, Jia Konstantin
Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue
On 10/13/2017 9:02 AM, Jia He Wrote: Hi Jerin On 10/13/2017 1:23 AM, Jerin Jacob Wrote: -Original Message- Date: Thu, 12 Oct 2017 17:05:50 + [...] On the same lines, Jia He, jie2.liu, bing.zhao, Is this patch based on code review or do you saw this issue on any of the arm/ppc target? arm64 will have performance impact with this change. sorry, miss one important information Our platform is an aarch64 server with 46 cpus. If we reduced the involved cpu numbers, the bug occurred less frequently. Yes, mb barrier impact the performance, but correctness is more important, isn't it ;-) Maybe we can find any other lightweight barrier here? Cheers, Jia Based on mbuf_autotest, the rte_panic will be invoked in seconds. PANIC in test_refcnt_iter(): (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free 1: [./test(rte_dump_stack+0x38) [0x58d868]] Aborted (core dumped) Cheers, Jia Konstantin
Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue
Hi On 10/13/2017 9:02 AM, Jia He Wrote: Hi Jerin On 10/13/2017 1:23 AM, Jerin Jacob Wrote: -Original Message- Date: Thu, 12 Oct 2017 17:05:50 + [...] On the same lines, Jia He, jie2.liu, bing.zhao, Is this patch based on code review or do you saw this issue on any of the arm/ppc target? arm64 will have performance impact with this change. sorry, miss one important information Our platform is an aarch64 server with 46 cpus. If we reduced the involved cpu numbers, the bug occurred less frequently. Yes, mb barrier impact the performance, but correctness is more important, isn't it ;-) Maybe we can find any other lightweight barrier here? Cheers, Jia Based on mbuf_autotest, the rte_panic will be invoked in seconds. PANIC in test_refcnt_iter(): (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free 1: [./test(rte_dump_stack+0x38) [0x58d868]] Aborted (core dumped) Cheers, Jia Konstantin
Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue
Hi Jerin On 10/13/2017 9:49 AM, Jerin Jacob Wrote: -Original Message- Date: Fri, 13 Oct 2017 09:16:31 +0800 From: Jia He To: Jerin Jacob , "Ananyev, Konstantin" Cc: Olivier MATZ , "dev@dpdk.org" , "jia...@hxt-semitech.com" , "jie2@hxt-semitech.com" , "bing.z...@hxt-semitech.com" Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0 Hi On 10/13/2017 9:02 AM, Jia He Wrote: Hi Jerin On 10/13/2017 1:23 AM, Jerin Jacob Wrote: -Original Message- Date: Thu, 12 Oct 2017 17:05:50 + [...] On the same lines, Jia He, jie2.liu, bing.zhao, Is this patch based on code review or do you saw this issue on any of the arm/ppc target? arm64 will have performance impact with this change. sorry, miss one important information Our platform is an aarch64 server with 46 cpus. Is this an OOO(Out of order execution) aarch64 CPU implementation? I think so, it is a server cpu (ARMv8-A), but do you know how to confirm it? cat /proc/cpuinfo processor : 0 BogoMIPS : 40.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid asimdrdm CPU implementer : 0x51 CPU architecture: 8 CPU variant : 0x0 CPU part : 0x800 CPU revision : 0 If we reduced the involved cpu numbers, the bug occurred less frequently. Yes, mb barrier impact the performance, but correctness is more important, isn't it ;-) Yes. Maybe we can find any other lightweight barrier here? Yes, Regarding the lightweight barrier, arm64 has native support for acquire and release semantics, which is exposed through gcc as architecture agnostic functions. https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html http://preshing.com/20130922/acquire-and-release-fences/ Good to know, 1) How much overhead this patch in your platform? Just relative numbers are enough I create a *standalone* test case for test_mbuf Attached the debug patch It is hard to believe but the truth is that the performance after adding rmb barrier is better than no adding. With this patch (4 times running) time ./test_good --no-huge -l 1-20 real 0m23.311s user 7m21.870s sys 0m0.021s time ./test_bad --no-huge -l 1-20 Without this patch real 0m38.972s user 12m35.271s sys 0m0.030s Cheers, Jia 2) As a prototype, Is Changing to acquire and release schematics reduces the overhead in your platform? Reference FreeBSD ring/DPDK style ring implementation through acquire and release schematics https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c I will also spend on cycles on this. Cheers, Jia Based on mbuf_autotest, the rte_panic will be invoked in seconds. PANIC in test_refcnt_iter(): (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free 1: [./test(rte_dump_stack+0x38) [0x58d868]] Aborted (core dumped) Cheers, Jia Konstantin diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index 5e9b3b7..23168e7 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -517,6 +517,7 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, n = max; *old_head = r->cons.head; + rte_smp_rmb(); const uint32_t prod_tail = r->prod.tail; /* The subtraction is done between two unsigned 32bits value * (the result is always modulo 32 bits even if we have diff --git a/test/test/test.c b/test/test/test.c index 9accbd1..6910b06 100644 --- a/test/test/test.c +++ b/test/test/test.c @@ -61,6 +61,7 @@ extern cmdline_parse_ctx_t main_ctx[]; #include "test.h" +extern int test_mbuf(void); #define RTE_LOGTYPE_APP RTE_LOGTYPE_USER1 const char *prgname; /* to be set to argv[0] */ @@ -135,6 +136,9 @@ main(int argc, char **argv) RTE_LOG(INFO, APP, "HPET is not enabled, using TSC as default timer\n"); + test_mbuf(); + printf("test_mbuf done\n"); + return 0; #ifdef RTE_LIBRTE_CMDLINE cl = cmdline_stdin_new(main_ctx, "RTE>>"); diff --git a/test/test/test_mbuf.c b/test/test/test_mbuf.c index 3396b4a..7277260 100644 --- a/test/test/test_mbuf.c +++ b/test/test/test_mbuf.c @@ -59,7 +59,6 @@ #include #include "test.h" - #define MBUF_DATA_SIZE 2048 #define NB_MBUF 128 #define MBUF_TEST_DATA_LEN 1464 @@ -85,7 +84,7 @@ static volatile uint32_t refcnt_stop_slaves; static unsigned refcnt_lcore[RTE_MAX_LCORE]; - +int test_mbuf(void); #endif /* @@ -718,7 +717,7 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter, for (i = 0, n = rte_mempool_avail_count(refcnt_pool); i != n && (m = rte_pktmbuf_alloc(refcnt_pool)) != NULL; i++) { -
Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue
Hi On 10/13/2017 3:33 PM, Jianbo Liu Wrote: The 10/13/2017 07:19, Jerin Jacob wrote: -Original Message- Date: Fri, 13 Oct 2017 09:16:31 +0800 From: Jia He To: Jerin Jacob , "Ananyev, Konstantin" Cc: Olivier MATZ , "dev@dpdk.org" , "jia...@hxt-semitech.com" , "jie2@hxt-semitech.com" , "bing.z...@hxt-semitech.com" Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0 Hi On 10/13/2017 9:02 AM, Jia He Wrote: Hi Jerin On 10/13/2017 1:23 AM, Jerin Jacob Wrote: -Original Message- Date: Thu, 12 Oct 2017 17:05:50 + [...] On the same lines, Jia He, jie2.liu, bing.zhao, Is this patch based on code review or do you saw this issue on any of the arm/ppc target? arm64 will have performance impact with this change. sorry, miss one important information Our platform is an aarch64 server with 46 cpus. Is this an OOO(Out of order execution) aarch64 CPU implementation? If we reduced the involved cpu numbers, the bug occurred less frequently. Yes, mb barrier impact the performance, but correctness is more important, isn't it ;-) Yes. Maybe we can find any other lightweight barrier here? Yes, Regarding the lightweight barrier, arm64 has native support for acquire and release semantics, which is exposed through gcc as architecture agnostic functions. https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html http://preshing.com/20130922/acquire-and-release-fences/ Good to know, 1) How much overhead this patch in your platform? Just relative numbers are enough 2) As a prototype, Is Changing to acquire and release schematics reduces the overhead in your platform? +1, can you try what ODP does in the link mentioned below? Sure, pls see the result: root@server:~/odp/test/linux-generic/ring# ./ring_main HW time counter freq: 2000 hz _ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory _ishm.c:880:_odp_ishm_reserve():No huge pages, fall back to normal pages. check: /proc/sys/vm/nr_hugepages. _ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory _ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory _ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory PKTIO: initialized loop interface. PKTIO: initialized ipc interface. PKTIO: initialized socket mmap, use export ODP_PKTIO_DISABLE_SOCKET_MMAP=1 to disable. PKTIO: initialized socket mmsg,use export ODP_PKTIO_DISABLE_SOCKET_MMSG=1 to disable. _ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory _ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory _ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory ODP API version: 1.15.0 ODP implementation name: "odp-linux" ODP implementation version: "odp-linux" 1.15.0-0 (v1.15.0) 1.15.0.0 CUnit - A unit testing framework for C - Version 2.1-3 http://cunit.sourceforge.net/ Suite: ring basic Test: ring_test_basic_create ...pktio/ring.c:177:_ring_create():Requested size is invalid, must be power of 2,and do not exceed the size limit 268435455 _ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory _ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory passed Test: ring_test_basic_burst ...passed Test: ring_test_basic_bulk ...passed Test: ring_test_basic_watermark ...passed_ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory Suite: ring stress Test: ring_test_stress_1_1_producer_consumer ...passed Test: ring_test_stress_1_N_producer_consumer ...passed Test: ring_test_stress_N_1_producer_consumer ...passed Test: ring_test_stress_N_M_producer_consumer ... Cheers, Jia Reference FreeBSD ring/DPDK style ring implementation through acquire and release schematics https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c I will also spend on cycles on this. Cheers, Jia Based on mbuf_autotest, the rte_panic will be invoked in seconds. PANIC in test_refcnt_iter(): (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free 1: [./test(rte_dump_stack+0x38) [0x58d868]] Aborted (core dumped) Cheers, Jia Konstantin -- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue
On 10/20/2017 4:02 AM, Ananyev, Konstantin Wrote: -Original Message- From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Ananyev, Konstantin Sent: Thursday, October 19, 2017 3:15 PM To: Zhao, Bing ; Jia He ; Jerin Jacob Cc: Olivier MATZ ; dev@dpdk.org; jia...@hxt-semitech.com; jie2@hxt-semitech.com; bing.zhao@hxt- semitech.com Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue Hi, On 2017/10/19 18:02, Ananyev, Konstantin wrote: Hi Jia, Hi On 10/13/2017 9:02 AM, Jia He Wrote: Hi Jerin On 10/13/2017 1:23 AM, Jerin Jacob Wrote: -Original Message- Date: Thu, 12 Oct 2017 17:05:50 + [...] On the same lines, Jia He, jie2.liu, bing.zhao, Is this patch based on code review or do you saw this issue on any of the arm/ppc target? arm64 will have performance impact with this change. sorry, miss one important information Our platform is an aarch64 server with 46 cpus. If we reduced the involved cpu numbers, the bug occurred less frequently. Yes, mb barrier impact the performance, but correctness is more important, isn't it ;-) Maybe we can find any other lightweight barrier here? Cheers, Jia Based on mbuf_autotest, the rte_panic will be invoked in seconds. PANIC in test_refcnt_iter(): (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free 1: [./test(rte_dump_stack+0x38) [0x58d868]] Aborted (core dumped) So is it only reproducible with mbuf refcnt test? Could it be reproduced with some 'pure' ring test (no mempools/mbufs refcnt, etc.)? The reason I am asking - in that test we also have mbuf refcnt updates (that's what for that test was created) and we are doing some optimizations here too to avoid excessive atomic updates. BTW, if the problem is not reproducible without mbuf refcnt, can I suggest to extend the test with: - add a check that enqueue() operation was successful - walk through the pool and check/printf refcnt of each mbuf. Hopefully that would give us some extra information what is going wrong here. Konstantin Currently, the issue is only found in this case here on the ARM platform, not sure how it is going with the X86_64 platform I understand that it is only reproducible on arm so far. What I am asking - with dpdk is there any other way to reproduce it (on arm) except then running mbuf_autotest? Something really simple that not using mbuf/mempool etc? Just do dequeue/enqueue from multiple threads and check data integrity at the end? If not - what makes you think that the problem is precisely in rte_ring code? Why not in rte_mbuf let say? Actually I reread your original mail and finally get your point. If I understand you correctly the problem with read reordering here is that after we read prot.tail but before we read cons.head both cons.head and prod.tail might be updated, but for us prod.tail change might be unnoticed. As an example: time 0 (cons.head == 0, prod.tail == 0): prod_tail = r->prod.tail; /* due read reordering */ /* prod_tail == 0 */ time 1 (cons.head ==5, prod.tail == 5): *old_head = r->cons.head; /* cons.head == 5 */ *entries = (prod_tail - *old_head); /* *entries == (0 - 5) == 0xFFFB */ And we will move our cons.head forward, even though there are no filled entries in the ring. Is that right? Yes As I side notice, I wonder why we have here: *entries = (prod_tail - *old_head); instead of: *entries = r->size + prod_tail - *old_head; ? Yes, I agree with you at this code line. But reordering will still mess up things even after this change(I have tested, still the same as before) I thought the *entries is a door to prevent consumer from moving forward too fast than the producer. But in some cases, it is correct that prod_tail is smaller than *old_head due to the cirular queue. In other cases, it is incorrect because of read/read reordering. AFAICT, the root cause here is the producer tail and cosumer head are dependant on each other. Thus a memory barrier is neccessary. Cheers, Jia Konstantin . In another mail of this thread, we've made a simple test based on this and captured some information and I pasted there.(I pasted the patch there :-)) Are you talking about that one: http://dpdk.org/dev/patchwork/patch/30405/ ? It still uses test/test/test_mbuf.c..., but anyway I don't really understand how mbuf_autotest supposed to work with these changes: @@ -730,7 +739,7 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter, rte_ring_enqueue(refcnt_mbuf_ring, m); } } - rte_pktmbuf_free(m); + // rte_pktmbuf_free(m); } @@ -741,6 +750,12 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter, while (!rte_ring_empty(refcnt_mbuf_ring)) ; + if (NULL != m) { + if (1 != rte_mbuf_refcnt_read(m)) + printf("m ref is %u\n", rte_mbuf_re
Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue
Hi Jerin On 10/20/2017 1:43 PM, Jerin Jacob Wrote: -Original Message- [...] dependant on each other. Thus a memory barrier is neccessary. Yes. The barrier is necessary. In fact, upstream freebsd fixed this issue for arm64. DPDK ring implementation is derived from freebsd's buf_ring.h. https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 I think, the only outstanding issue is, how to reduce the performance impact for arm64. I believe using accurate/release semantics instead of rte_smp_rmb() will reduce the performance overhead like similar ring implementations below, freebsd: https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 odp: https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c Jia, 1) Can you verify the use of accurate/release semantics fixes the problem in your platform? like use of atomic_load_acq* in the reference code. 2) If so, What is the overhead between accurate/release and plane smp_smb() barriers. Based on that we need decide what path to take. I've tested 3 cases. The new 3rd case is to use the load_acquire barrier (half barrier) you mentioned at above link. The patch seems like: @@ -408,8 +466,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, /* Reset n to the initial burst count */ n = max; - *old_head = r->prod.head; - const uint32_t cons_tail = r->cons.tail; + *old_head = atomic_load_acq_32(&r->prod.head); + const uint32_t cons_tail = atomic_load_acq_32(&r->cons.tail); @@ -516,14 +576,15 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_s /* Restore n as it may change every loop */ n = max; - *old_head = r->cons.head; - const uint32_t prod_tail = r->prod.tail; + *old_head = atomic_load_acq_32(&r->cons.head); + const uint32_t prod_tail = atomic_load_acq_32(&r->prod.tail) /* The subtraction is done between two unsigned 32bits value * (the result is always modulo 32 bits even if we have * cons_head > prod_tail). So 'entries' is always between 0 * and size(ring)-1. */ The half barrier patch passed the fuctional test. As for the performance comparision on *arm64*(the debug patch is at http://dpdk.org/ml/archives/dev/2017-October/079012.html), please see the test results below: [case 1] old codes, no barrier Performance counter stats for './test --no-huge -l 1-10': 689275.001200 task-clock (msec) # 9.771 CPUs utilized 6223 context-switches # 0.009 K/sec 10 cpu-migrations # 0.000 K/sec 653 page-faults # 0.001 K/sec 1721190914583 cycles # 2.497 GHz 3363238266430 instructions # 1.95 insn per cycle branches 27804740 branch-misses # 0.00% of all branches 70.540618825 seconds time elapsed [case 2] full barrier with rte_smp_rmb() Performance counter stats for './test --no-huge -l 1-10': 582557.895850 task-clock (msec) # 9.752 CPUs utilized 5242 context-switches # 0.009 K/sec 10 cpu-migrations # 0.000 K/sec 665 page-faults # 0.001 K/sec 1454360730055 cycles # 2.497 GHz 587197839907 instructions # 0.40 insn per cycle branches 27799687 branch-misses # 0.00% of all branches 59.735582356 seconds time elapse [case 1] half barrier with load_acquire Performance counter stats for './test --no-huge -l 1-10': 660758.877050 task-clock (msec) # 9.764 CPUs utilized 5982 context-switches # 0.009 K/sec 11 cpu-migrations # 0.000 K/sec 657 page-faults # 0.001 K/sec 1649875318044 cycles # 2.497 GHz 591583257765 instructions # 0.36 insn per cycle branches 27994903 branch-misses # 0.00% of all branches 67.672855107 seconds time elapsed Please see the context-switches in the perf results test result sorted by time is: full barrier < half barrier < no barrier AFAICT, in this case ,the cpu reordering will add the possibility for context switching and increase the running time. Any ideas? Cheers, Jia Note: This issue wont come in all the arm64 implementation. it comes on arm64 implementation with OOO(out of order) implementations. Cheers, Jia
Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue
Hi Jerin On 10/23/2017 6:06 PM, Jerin Jacob Wrote: -Original Message- Date: Mon, 23 Oct 2017 16:49:01 +0800 From: Jia He To: Jerin Jacob Cc: "Ananyev, Konstantin" , "Zhao, Bing" , Olivier MATZ , "dev@dpdk.org" , "jia...@hxt-semitech.com" , "jie2@hxt-semitech.com" , "bing.z...@hxt-semitech.com" , "Richardson, Bruce" Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 Hi Jerin On 10/20/2017 1:43 PM, Jerin Jacob Wrote: -Original Message- [...] dependant on each other. Thus a memory barrier is neccessary. Yes. The barrier is necessary. In fact, upstream freebsd fixed this issue for arm64. DPDK ring implementation is derived from freebsd's buf_ring.h. https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 I think, the only outstanding issue is, how to reduce the performance impact for arm64. I believe using accurate/release semantics instead of rte_smp_rmb() will reduce the performance overhead like similar ring implementations below, freebsd: https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 odp: https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c Jia, 1) Can you verify the use of accurate/release semantics fixes the problem in your platform? like use of atomic_load_acq* in the reference code. 2) If so, What is the overhead between accurate/release and plane smp_smb() barriers. Based on that we need decide what path to take. I've tested 3 cases. The new 3rd case is to use the load_acquire barrier (half barrier) you mentioned at above link. The patch seems like: @@ -408,8 +466,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, /* Reset n to the initial burst count */ n = max; - *old_head = r->prod.head; - const uint32_t cons_tail = r->cons.tail; + *old_head = atomic_load_acq_32(&r->prod.head); + const uint32_t cons_tail = atomic_load_acq_32(&r->cons.tail); @@ -516,14 +576,15 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_s /* Restore n as it may change every loop */ n = max; - *old_head = r->cons.head; - const uint32_t prod_tail = r->prod.tail; + *old_head = atomic_load_acq_32(&r->cons.head); + const uint32_t prod_tail = atomic_load_acq_32(&r->prod.tail) /* The subtraction is done between two unsigned 32bits value * (the result is always modulo 32 bits even if we have * cons_head > prod_tail). So 'entries' is always between 0 * and size(ring)-1. */ The half barrier patch passed the fuctional test. As for the performance comparision on *arm64*(the debug patch is at http://dpdk.org/ml/archives/dev/2017-October/079012.html), please see the test results below: [case 1] old codes, no barrier Performance counter stats for './test --no-huge -l 1-10': 689275.001200 task-clock (msec) # 9.771 CPUs utilized 6223 context-switches # 0.009 K/sec 10 cpu-migrations # 0.000 K/sec 653 page-faults # 0.001 K/sec 1721190914583 cycles # 2.497 GHz 3363238266430 instructions # 1.95 insn per cycle branches 27804740 branch-misses # 0.00% of all branches 70.540618825 seconds time elapsed [case 2] full barrier with rte_smp_rmb() Performance counter stats for './test --no-huge -l 1-10': 582557.895850 task-clock (msec) # 9.752 CPUs utilized 5242 context-switches # 0.009 K/sec 10 cpu-migrations # 0.000 K/sec 665 page-faults # 0.001 K/sec 1454360730055 cycles # 2.497 GHz 587197839907 instructions # 0.40 insn per cycle branches 27799687 branch-misses # 0.00% of all branches 59.735582356 seconds time elapse [case 1] half barrier with load_acquire Performance counter stats for './test --no-huge -l 1-10': 660758.877050 task-clock (msec) # 9.764 CPUs utilized 5982 context-switches # 0.009 K/sec 11 cpu-migrations # 0.000 K/sec 657 page-faults
Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue
Hi Jerin On 10/25/2017 9:26 PM, Jerin Jacob Wrote: -Original Message- Date: Tue, 24 Oct 2017 10:04:26 +0800 From: Jia He To: Jerin Jacob Cc: "Ananyev, Konstantin" , "Zhao, Bing" , Olivier MATZ , "dev@dpdk.org" , "jia...@hxt-semitech.com" , "jie2@hxt-semitech.com" , "bing.z...@hxt-semitech.com" , "Richardson, Bruce" Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 Hi Jerin Hi Jia, example: ./build/app/test -c 0xff -n 4 ring_perf_autotest Seem in our arm64 server, the ring_perf_autotest will be finished in a few seconds: Yes. It just need a few seconds. Anything wrong about configuration or environment setup? By default, arm64+dpdk will be using el0 counter to measure the cycles. I think, in your SoC, it will be running at 50MHz or 100MHz.So, You can follow the below scheme to get accurate cycle measurement scheme: See: http://dpdk.org/doc/guides/prog_guide/profile_app.html check: 44.2.2. High-resolution cycle counter Thank you for the suggestions. But I tried your provided ko module to enable the accurate cycle measurement in user space, the test result is as below: root@nfv-demo01:~/dpdk/build/build/test/test# lsmod |grep pmu pmu_el0_cycle_counter 262144 0 [old codes, without any patches] RTE>>ring_perf_autotest ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 0 MP/MC single enq/dequeue: 0 SP/SC burst enq/dequeue (size: 8): 0 MP/MC burst enq/dequeue (size: 8): 0 SP/SC burst enq/dequeue (size: 32): 0 MP/MC burst enq/dequeue (size: 32): 0 ### Testing empty dequeue ### SC empty dequeue: 0.00 MC empty dequeue: 0.00 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 0.00 MP/MC bulk enq/dequeue (size: 8): 0.00 SP/SC bulk enq/dequeue (size: 32): 0.00 MP/MC bulk enq/dequeue (size: 32): 0.00 ### Testing using two hyperthreads ### SP/SC bulk enq/dequeue (size: 8): 0.00 MP/MC bulk enq/dequeue (size: 8): 0.00 SP/SC bulk enq/dequeue (size: 32): 0.00 MP/MC bulk enq/dequeue (size: 32): 0.00 Test OK [with full rte_smp_rmb barrier patch] == RTE>>ring_perf_autotest ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 0 MP/MC single enq/dequeue: 0 SP/SC burst enq/dequeue (size: 8): 0 MP/MC burst enq/dequeue (size: 8): 0 SP/SC burst enq/dequeue (size: 32): 0 MP/MC burst enq/dequeue (size: 32): 0 ### Testing empty dequeue ### SC empty dequeue: 0.00 MC empty dequeue: 0.00 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 0.00 MP/MC bulk enq/dequeue (size: 8): 0.00 SP/SC bulk enq/dequeue (size: 32): 0.00 MP/MC bulk enq/dequeue (size: 32): 0.00 ### Testing using two hyperthreads ### SP/SC bulk enq/dequeue (size: 8): 0.00 MP/MC bulk enq/dequeue (size: 8): 0.00 SP/SC bulk enq/dequeue (size: 32): 0.00 MP/MC bulk enq/dequeue (size: 32): 0.00 Test OK RTE>> No difference,all time is 0 ? If I rmmod pmu_el0_cycle_counter and revise the ./build/.config to comment the config line #CONFIG_RTE_ARM_EAL_RDTSC_USE_PMU=y Then the time is bigger than 0 root@ubuntu:/home/hj/dpdk/build/build/test/test# ./test -c 0xff -n 4 EAL: Detected 44 lcore(s) EAL: Probing VFIO support... APP: HPET is not enabled, using TSC as default timer RTE>>per_lcore_autotest RTE>>ring_perf_autotest ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 0 MP/MC single enq/dequeue: 2 SP/SC burst enq/dequeue (size: 8): 0 If you follow the above link, The value '0' will be replaced with more meaning full data. MP/MC burst enq/dequeue (size: 8): 0 SP/SC burst enq/dequeue (size: 32): 0 MP/MC burst enq/dequeue (size: 32): 0 ### Testing empty dequeue ### SC empty dequeue: 0.02 MC empty dequeue: 0.04 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 0.12 MP/MC bulk enq/dequeue (size: 8): 0.31 SP/SC bulk enq/dequeue (size: 32): 0.05 MP/MC bulk enq/dequeue (size: 32): 0.09 ### Testing using two hyperthreads ### SP/SC bulk enq/dequeue (size: 8): 0.12 MP/MC bulk enq/dequeue (size: 8): 0.39 SP/SC bulk enq/dequeue (size: 32): 0.04 MP/MC bulk enq/dequeue (size: 32): 0.12 ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 0.37 MP/MC bulk enq/dequeue (size: 8): 0.92 SP/SC bulk enq/dequeue (size: 32): 0.12 MP/MC bulk enq/dequeue (size: 32): 0.26 Test OK RTE>> Cheers, Jia By default, arm64+dpdk will be using el0 counter to measure the cycles. I think, in your SoC, it will be running at 50MHz or 100MHz.So, You can follow the below scheme to get accurate cycle measurement scheme: See: http://dpdk.org/doc/guides/prog_guide/profile_app.html check: 44.2.2. High-resolution cycle counter -- Cheers, Jia
Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue
Hi Jerin Do you think next step whether I need to implement the load_acquire half barrier as per freebsd or find any other performance test case to compare the performance impact? Thanks for any suggestions. Cheers, Jia On 10/25/2017 9:26 PM, Jerin Jacob Wrote: -Original Message- Date: Tue, 24 Oct 2017 10:04:26 +0800 From: Jia He To: Jerin Jacob Cc: "Ananyev, Konstantin" , "Zhao, Bing" , Olivier MATZ , "dev@dpdk.org" , "jia...@hxt-semitech.com" , "jie2@hxt-semitech.com" , "bing.z...@hxt-semitech.com" , "Richardson, Bruce" Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 Hi Jerin Hi Jia, example: ./build/app/test -c 0xff -n 4 ring_perf_autotest Seem in our arm64 server, the ring_perf_autotest will be finished in a few seconds: Yes. It just need a few seconds. Anything wrong about configuration or environment setup? By default, arm64+dpdk will be using el0 counter to measure the cycles. I think, in your SoC, it will be running at 50MHz or 100MHz.So, You can follow the below scheme to get accurate cycle measurement scheme: See: http://dpdk.org/doc/guides/prog_guide/profile_app.html check: 44.2.2. High-resolution cycle counter root@ubuntu:/home/hj/dpdk/build/build/test/test# ./test -c 0xff -n 4 EAL: Detected 44 lcore(s) EAL: Probing VFIO support... APP: HPET is not enabled, using TSC as default timer RTE>>per_lcore_autotest RTE>>ring_perf_autotest ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 0 MP/MC single enq/dequeue: 2 SP/SC burst enq/dequeue (size: 8): 0 If you follow the above link, The value '0' will be replaced with more meaning full data. MP/MC burst enq/dequeue (size: 8): 0 SP/SC burst enq/dequeue (size: 32): 0 MP/MC burst enq/dequeue (size: 32): 0 ### Testing empty dequeue ### SC empty dequeue: 0.02 MC empty dequeue: 0.04 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 0.12 MP/MC bulk enq/dequeue (size: 8): 0.31 SP/SC bulk enq/dequeue (size: 32): 0.05 MP/MC bulk enq/dequeue (size: 32): 0.09 ### Testing using two hyperthreads ### SP/SC bulk enq/dequeue (size: 8): 0.12 MP/MC bulk enq/dequeue (size: 8): 0.39 SP/SC bulk enq/dequeue (size: 32): 0.04 MP/MC bulk enq/dequeue (size: 32): 0.12 ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 0.37 MP/MC bulk enq/dequeue (size: 8): 0.92 SP/SC bulk enq/dequeue (size: 32): 0.12 MP/MC bulk enq/dequeue (size: 32): 0.26 Test OK RTE>> Cheers, Jia By default, arm64+dpdk will be using el0 counter to measure the cycles. I think, in your SoC, it will be running at 50MHz or 100MHz.So, You can follow the below scheme to get accurate cycle measurement scheme: See: http://dpdk.org/doc/guides/prog_guide/profile_app.html check: 44.2.2. High-resolution cycle counter -- Cheers, Jia
Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue
Hi Jerin Thanks for your suggestions. I will try to use config macro to let it be chosen by user. I need to point out one possible issue in your load_acq/store_rel patch at https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch @@ -516,8 +541,13 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, /* Restore n as it may change every loop */ n = max; +#if 0 *old_head = r->cons.head; const uint32_t prod_tail = r->prod.tail; +#else + *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_RELAXED); --[1] + const uint32_t prod_tail = __atomic_load_n(&r->prod.tail, __ATOMIC_ACQUIRE); --[2] +#endif line [1] __ATOMIC_RELAXED is not enough for this case(tested in our ARM64 server). line [2] __ATOMIC_ACQUIRE guarantee the 2nd load will not be reorded before the 1st load, but will not guarantee the 1st load will not be reordered after the 2nd load. Please also refer to your mentioned freebsd implementation. They use __ATOMIC_ACQUIRE at line [1]. Should it be like instead? +#else + *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE); + const uint32_t prod_tail = __atomic_load_n(&r->prod.tail, __ATOMIC_ACQUIRE); Cheers, Jia On 10/31/2017 7:14 PM, Jerin Jacob Wrote: -Original Message- Date: Tue, 31 Oct 2017 10:55:15 +0800 From: Jia He To: Jerin Jacob Cc: "Ananyev, Konstantin" , "Zhao, Bing" , Olivier MATZ , "dev@dpdk.org" , "jia...@hxt-semitech.com" , "jie2@hxt-semitech.com" , "bing.z...@hxt-semitech.com" , "Richardson, Bruce" Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 Hi Jerin Hi Jia, Do you think next step whether I need to implement the load_acquire half barrier as per freebsd I did a quick prototype using C11 memory model(ACQUIRE/RELEASE) schematics and tested on two arm64 platform in Cavium(Platform A: Non arm64 OOO machine) and Platform B: arm64 OOO machine) smp_rmb() performs better in Platform A: acquire/release semantics perform better in platform B: Here is the patch: https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch In terms of next step: - I am not sure the cost associated with acquire/release semantics on x86 or ppc. IMO, We need to have both options under conditional compilation flags and let the target platform choose the best one. Thoughts? Here is the performance numbers: - Both platforms are running at different frequency, So absolute numbers does not matter, Just check the relative numbers. Platform A: Performance numbers: no patch(Non arm64 OOO machine) --- SP/SC single enq/dequeue: 40 MP/MC single enq/dequeue: 282 SP/SC burst enq/dequeue (size: 8): 11 MP/MC burst enq/dequeue (size: 8): 42 SP/SC burst enq/dequeue (size: 32): 8 MP/MC burst enq/dequeue (size: 32): 16 ### Testing empty dequeue ### SC empty dequeue: 8.01 MC empty dequeue: 11.01 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 11.30 MP/MC bulk enq/dequeue (size: 8): 42.85 SP/SC bulk enq/dequeue (size: 32): 8.25 MP/MC bulk enq/dequeue (size: 32): 16.46 ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 20.62 MP/MC bulk enq/dequeue (size: 8): 56.30 SP/SC bulk enq/dequeue (size: 32): 10.94 MP/MC bulk enq/dequeue (size: 32): 18.66 Test OK # smp_rmb() patch((Non OOO arm64 machine) http://dpdk.org/dev/patchwork/patch/30029/ - SP/SC single enq/dequeue: 42 MP/MC single enq/dequeue: 291 SP/SC burst enq/dequeue (size: 8): 12 MP/MC burst enq/dequeue (size: 8): 44 SP/SC burst enq/dequeue (size: 32): 8 MP/MC burst enq/dequeue (size: 32): 16 ### Testing empty dequeue ### SC empty dequeue: 13.01 MC empty dequeue: 15.01 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 11.60 MP/MC bulk enq/dequeue (size: 8): 44.32 SP/SC bulk enq/dequeue (size: 32): 8.60 MP/MC bulk enq/dequeue (size: 32): 16.50 ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 20.95 MP/MC bulk enq/dequeue (size: 8): 56.90 SP/SC bulk enq/dequeue (size: 32): 10.90 MP/MC bulk enq/dequeue (size: 32): 18.78 Test OK RTE>> # c11 memory model patch((Non OOO arm64 machine) https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch - ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 197 MP/MC single enq/dequeue: 328 SP/SC burst enq/dequeue (size: 8): 31 MP/MC burst enq/dequeue (size: 8): 50 SP/SC burst enq/dequeue (size:
Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue
Hi Jerin On 10/31/2017 7:14 PM, Jerin Jacob Wrote: -Original Message- Date: Tue, 31 Oct 2017 10:55:15 +0800 From: Jia He To: Jerin Jacob Cc: "Ananyev, Konstantin" , "Zhao, Bing" , Olivier MATZ , "dev@dpdk.org" , "jia...@hxt-semitech.com" , "jie2@hxt-semitech.com" , "bing.z...@hxt-semitech.com" , "Richardson, Bruce" Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 Hi Jerin Hi Jia, Do you think next step whether I need to implement the load_acquire half barrier as per freebsd I did a quick prototype using C11 memory model(ACQUIRE/RELEASE) schematics and tested on two arm64 platform in Cavium(Platform A: Non arm64 OOO machine) and Platform B: arm64 OOO machine) Can you elaborate anything about your Non arm64 OOO machine? As I know, all arm64 server is strong memory order. Am I missed anything? -- Cheers, Jia
Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue
Hi Jerin On 11/2/2017 3:04 AM, Jerin Jacob Wrote: Date: Thu, 2 Nov 2017 00:27:46 +0530 From: Jerin Jacob To: Jia He Cc: "Ananyev, Konstantin" , "Zhao, Bing" , Olivier MATZ , "dev@dpdk.org" , "jia...@hxt-semitech.com" , "jie2@hxt-semitech.com" , "bing.z...@hxt-semitech.com" , "Richardson, Bruce" , jianbo@arm.com, hemant.agra...@nxp.com Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue Message-ID: <20171101185723.GA18759@jerin> References: <2601191342ceee43887bde71ab9772585faab...@irsmsx103.ger.corp.intel.com> <3e580cd7-2854-d855-be9c-7c4ce06e3...@gmail.com> <20171020054319.GA4249@jerin> <20171023100617.GA17957@jerin> <20171025132642.GA13977@jerin> <2017103433.GA21742@jerin> <69adfb00-4582-b362-0540-d1d9d6bcf...@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <69adfb00-4582-b362-0540-d1d9d6bcf...@gmail.com> User-Agent: Mutt/1.9.1 (2017-09-22) -Original Message- Date: Wed, 1 Nov 2017 10:53:12 +0800 From: Jia He To: Jerin Jacob Cc: "Ananyev, Konstantin" , "Zhao, Bing" , Olivier MATZ , "dev@dpdk.org" , "jia...@hxt-semitech.com" , "jie2@hxt-semitech.com" , "bing.z...@hxt-semitech.com" , "Richardson, Bruce" , jianbo@arm.com, hemant.agra...@nxp.com Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 Hi Jerin Thanks for your suggestions. I will try to use config macro to let it be chosen by user. It is better, if we avoid #ifdef's in the common code. I think, you can do the scheme like examples/l3fwd/l3fwd_em_hlm_neon.h. Where, the common code will have all generic routine, difference will be moved to a separate files to reduce #ifdef clutter. I need to point out one possible issue in your load_acq/store_rel patch at https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch @@ -516,8 +541,13 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, /* Restore n as it may change every loop */ n = max; +#if 0 *old_head = r->cons.head; const uint32_t prod_tail = r->prod.tail; +#else + *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_RELAXED); --[1] + const uint32_t prod_tail = __atomic_load_n(&r->prod.tail, __ATOMIC_ACQUIRE); --[2] +#endif line [1] __ATOMIC_RELAXED is not enough for this case(tested in our ARM64 server). line [2] __ATOMIC_ACQUIRE guarantee the 2nd load will not be reorded before the 1st load, but will not guarantee the 1st load will not be reordered after the 2nd load. Please also For me it looks same. [1] can not cross [2] is same as [2] cannot cross [1], if [1] are [2] at back to back. No ? No, IIUC, load_acquire(a) is equal to the pseudo codes: load(a) one-direction-barrier() instead of one-direction-barrier() load(a) Thus, in below cases, load(a) and load(b) can still be reordered, this is not the semantic violation of the load_acquire: load(a) load_acquire(b) IIUC, the orginal implementation in https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170 is not optimal. I tested the changes as follow and it works fine: + *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE); const uint32_t prod_tail = r->prod.tail; i.e. load_acquire(a) load(b) Cheers, Jia
[dpdk-dev] [PATCH v2] ring: guarantee ordering of cons/prod loading when doing
We watched a rte panic of mbuf_autotest in our qualcomm arm64 server. As for the possible race condition, please refer to [1]. Furthermore, there are 2 options as suggested by Jerin: 1. use rte_smp_rmb 2. use load_acquire/store_release(refer to [2]). CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER is provided, and by default it is n; The reason why providing 2 options is due to the performance benchmark difference in different arm machines, please refer to [3]. Already fuctionally tested on the machines as follows: on X86(passed the compilation) on arm64 with CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER=y on arm64 with CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER=n [1] http://dpdk.org/ml/archives/dev/2017-October/078366.html [2] https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170 [3] http://dpdk.org/ml/archives/dev/2017-October/080861.html --- Changelog: V2: let users choose whether using load_acquire/store_release V1: rte_smp_rmb() between 2 loads Signed-off-by: Jia He Signed-off-by: jie2@hxt-semitech.com Signed-off-by: bing.z...@hxt-semitech.com Signed-off-by: jia...@hxt-semitech.com Suggested-by: jerin.ja...@caviumnetworks.com --- lib/librte_ring/Makefile | 4 +++- lib/librte_ring/rte_ring.h | 38 -- lib/librte_ring/rte_ring_arm64.h | 48 ++ lib/librte_ring/rte_ring_generic.h | 45 +++ 4 files changed, 127 insertions(+), 8 deletions(-) create mode 100644 lib/librte_ring/rte_ring_arm64.h create mode 100644 lib/librte_ring/rte_ring_generic.h diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile index e34d9d9..fa57a86 100644 --- a/lib/librte_ring/Makefile +++ b/lib/librte_ring/Makefile @@ -45,6 +45,8 @@ LIBABIVER := 1 SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c # install includes -SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h +SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \ + rte_ring_arm64.h \ + rte_ring_generic.h include $(RTE_SDK)/mk/rte.lib.mk diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index 5e9b3b7..943b1f9 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -103,6 +103,18 @@ extern "C" { #include #include +/* In those strong memory models (e.g. x86), there is no need to add rmb() + * between load and load. + * In those weak models(powerpc/arm), there are 2 choices for the users + * 1.use rmb() memory barrier + * 2.use one-direcion load_acquire/store_release barrier + * It depends on performance test results. */ +#ifdef RTE_ARCH_ARM64 +#include "rte_ring_arm64.h" +#else +#include "rte_ring_generic.h" +#endif + #define RTE_TAILQ_RING_NAME "RTE_RING" enum rte_ring_queue_behavior { @@ -368,7 +380,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val, while (unlikely(ht->tail != old_val)) rte_pause(); - ht->tail = new_val; + arch_rte_atomic_store(&ht->tail, new_val, __ATOMIC_RELEASE); } /** @@ -408,7 +420,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, /* Reset n to the initial burst count */ n = max; - *old_head = r->prod.head; + *old_head = arch_rte_atomic_load(&r->prod.head, + __ATOMIC_ACQUIRE); const uint32_t cons_tail = r->cons.tail; /* * The subtraction is done between two unsigned 32bits value @@ -430,8 +443,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, if (is_sp) r->prod.head = *new_head, success = 1; else - success = rte_atomic32_cmpset(&r->prod.head, - *old_head, *new_head); + success = arch_rte_atomic32_cmpset(&r->prod.head, + old_head, *new_head, + 0, __ATOMIC_ACQUIRE, + __ATOMIC_RELAXED); } while (unlikely(success == 0)); return n; } @@ -470,7 +485,10 @@ __rte_ring_do_enqueue(struct rte_ring *r, void * const *obj_table, goto end; ENQUEUE_PTRS(r, &r[1], prod_head, obj_table, n, void *); + +#ifndef RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER rte_smp_wmb(); +#endif update_tail(&r->prod, prod_head, prod_next, is_sp); end: @@ -516,7 +534,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, /* Restore n as it may change every loop */ n = max; - *old_head = r->cons.head; + *old_head = arch_rte_atomic_load(&r->cons
Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue
Hi, Jerin please see my performance test below On 11/2/2017 3:04 AM, Jerin Jacob Wrote: [...] Should it be like instead? +#else + *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE); + const uint32_t prod_tail = __atomic_load_n(&r->prod.tail, __ATOMIC_ACQUIRE); It would be nice to see how much overhead it gives.ie back to back __ATOMIC_ACQUIRE. I can NOT test ring_perf_autotest in our server because of the something wrong in PMU counter. All the return value of rte_rdtsc is 0 with and without your provided ko module. I am still investigating the reason. I ever tested the difference with my debug patch, the difference is minor, less than +-1% -- Cheers, Jia Cheers, Jia On 10/31/2017 7:14 PM, Jerin Jacob Wrote: -Original Message- Date: Tue, 31 Oct 2017 10:55:15 +0800 From: Jia He To: Jerin Jacob Cc: "Ananyev, Konstantin" , "Zhao, Bing" , Olivier MATZ , "dev@dpdk.org" , "jia...@hxt-semitech.com" , "jie2@hxt-semitech.com" , "bing.z...@hxt-semitech.com" , "Richardson, Bruce" Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 Hi Jerin Hi Jia, Do you think next step whether I need to implement the load_acquire half barrier as per freebsd I did a quick prototype using C11 memory model(ACQUIRE/RELEASE) schematics and tested on two arm64 platform in Cavium(Platform A: Non arm64 OOO machine) and Platform B: arm64 OOO machine) smp_rmb() performs better in Platform A: acquire/release semantics perform better in platform B: Here is the patch: https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch In terms of next step: - I am not sure the cost associated with acquire/release semantics on x86 or ppc. IMO, We need to have both options under conditional compilation flags and let the target platform choose the best one. Thoughts? Here is the performance numbers: - Both platforms are running at different frequency, So absolute numbers does not matter, Just check the relative numbers. Platform A: Performance numbers: no patch(Non arm64 OOO machine) --- SP/SC single enq/dequeue: 40 MP/MC single enq/dequeue: 282 SP/SC burst enq/dequeue (size: 8): 11 MP/MC burst enq/dequeue (size: 8): 42 SP/SC burst enq/dequeue (size: 32): 8 MP/MC burst enq/dequeue (size: 32): 16 ### Testing empty dequeue ### SC empty dequeue: 8.01 MC empty dequeue: 11.01 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 11.30 MP/MC bulk enq/dequeue (size: 8): 42.85 SP/SC bulk enq/dequeue (size: 32): 8.25 MP/MC bulk enq/dequeue (size: 32): 16.46 ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 20.62 MP/MC bulk enq/dequeue (size: 8): 56.30 SP/SC bulk enq/dequeue (size: 32): 10.94 MP/MC bulk enq/dequeue (size: 32): 18.66 Test OK # smp_rmb() patch((Non OOO arm64 machine) http://dpdk.org/dev/patchwork/patch/30029/ - SP/SC single enq/dequeue: 42 MP/MC single enq/dequeue: 291 SP/SC burst enq/dequeue (size: 8): 12 MP/MC burst enq/dequeue (size: 8): 44 SP/SC burst enq/dequeue (size: 32): 8 MP/MC burst enq/dequeue (size: 32): 16 ### Testing empty dequeue ### SC empty dequeue: 13.01 MC empty dequeue: 15.01 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 11.60 MP/MC bulk enq/dequeue (size: 8): 44.32 SP/SC bulk enq/dequeue (size: 32): 8.60 MP/MC bulk enq/dequeue (size: 32): 16.50 ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 20.95 MP/MC bulk enq/dequeue (size: 8): 56.90 SP/SC bulk enq/dequeue (size: 32): 10.90 MP/MC bulk enq/dequeue (size: 32): 18.78 Test OK RTE>> # c11 memory model patch((Non OOO arm64 machine) https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch - ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 197 MP/MC single enq/dequeue: 328 SP/SC burst enq/dequeue (size: 8): 31 MP/MC burst enq/dequeue (size: 8): 50 SP/SC burst enq/dequeue (size: 32): 13 MP/MC burst enq/dequeue (size: 32): 18 ### Testing empty dequeue ### SC empty dequeue: 13.01 MC empty dequeue: 18.02 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 30.95 MP/MC bulk enq/dequeue (size: 8): 50.30 SP/SC bulk enq/dequeue (size: 32): 13.27 MP/MC bulk enq/dequeue (size: 32): 18.11 ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 43.38 MP/MC bulk enq/dequeue (size: 8): 64.42 SP/SC bulk enq/dequeue (size: 32): 16.71 MP/MC bulk enq/dequeue (size: 32): 22.21 Platform B: Performance numbers: ==
Re: [dpdk-dev] [PATCH v2] ring: guarantee ordering of cons/prod loading when doing
Hi Ananyev On 11/2/2017 9:26 PM, Ananyev, Konstantin Wrote: Hi Jia, -Original Message- From: Jia He [mailto:hejia...@gmail.com] Sent: Thursday, November 2, 2017 8:44 AM To: jerin.ja...@caviumnetworks.com; dev@dpdk.org; olivier.m...@6wind.com Cc: Ananyev, Konstantin ; Richardson, Bruce ; jianbo@arm.com; hemant.agra...@nxp.com; Jia He ; jie2@hxt-semitech.com; bing.z...@hxt-semitech.com; jia.he@hxt- semitech.com Subject: [PATCH v2] ring: guarantee ordering of cons/prod loading when doing We watched a rte panic of mbuf_autotest in our qualcomm arm64 server. As for the possible race condition, please refer to [1]. Furthermore, there are 2 options as suggested by Jerin: 1. use rte_smp_rmb 2. use load_acquire/store_release(refer to [2]). CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER is provided, and by default it is n; The reason why providing 2 options is due to the performance benchmark difference in different arm machines, please refer to [3]. Already fuctionally tested on the machines as follows: on X86(passed the compilation) on arm64 with CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER=y on arm64 with CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER=n [1] http://dpdk.org/ml/archives/dev/2017-October/078366.html [2] https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170 [3] http://dpdk.org/ml/archives/dev/2017-October/080861.html --- Changelog: V2: let users choose whether using load_acquire/store_release V1: rte_smp_rmb() between 2 loads Signed-off-by: Jia He Signed-off-by: jie2@hxt-semitech.com Signed-off-by: bing.z...@hxt-semitech.com Signed-off-by: jia...@hxt-semitech.com Suggested-by: jerin.ja...@caviumnetworks.com --- lib/librte_ring/Makefile | 4 +++- lib/librte_ring/rte_ring.h | 38 -- lib/librte_ring/rte_ring_arm64.h | 48 ++ lib/librte_ring/rte_ring_generic.h | 45 +++ 4 files changed, 127 insertions(+), 8 deletions(-) create mode 100644 lib/librte_ring/rte_ring_arm64.h create mode 100644 lib/librte_ring/rte_ring_generic.h diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile index e34d9d9..fa57a86 100644 --- a/lib/librte_ring/Makefile +++ b/lib/librte_ring/Makefile @@ -45,6 +45,8 @@ LIBABIVER := 1 SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c # install includes -SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h +SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \ + rte_ring_arm64.h \ + rte_ring_generic.h include $(RTE_SDK)/mk/rte.lib.mk diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index 5e9b3b7..943b1f9 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -103,6 +103,18 @@ extern "C" { #include #include +/* In those strong memory models (e.g. x86), there is no need to add rmb() + * between load and load. + * In those weak models(powerpc/arm), there are 2 choices for the users + * 1.use rmb() memory barrier + * 2.use one-direcion load_acquire/store_release barrier + * It depends on performance test results. */ +#ifdef RTE_ARCH_ARM64 +#include "rte_ring_arm64.h" +#else +#include "rte_ring_generic.h" +#endif + #define RTE_TAILQ_RING_NAME "RTE_RING" enum rte_ring_queue_behavior { @@ -368,7 +380,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val, while (unlikely(ht->tail != old_val)) rte_pause(); - ht->tail = new_val; + arch_rte_atomic_store(&ht->tail, new_val, __ATOMIC_RELEASE); } /** @@ -408,7 +420,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, /* Reset n to the initial burst count */ n = max; - *old_head = r->prod.head; + *old_head = arch_rte_atomic_load(&r->prod.head, + __ATOMIC_ACQUIRE); const uint32_t cons_tail = r->cons.tail; The code starts to look a bit messy with all these arch specific macros... So I wonder wouldn't it be more cleaner to: 1. move existing __rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail into rte_ring_generic.h 2. Add rte_smp_rmb into generic __rte_ring_move_prod_head/__rte_ring_move_cons_head (as was in v1 of your patch). 3. Introduce ARM specific versions of __rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail in the rte_ring_arm64.h That way we will keep ogneric code simple and clean, while still allowing arch specific optimizations. Thanks for your review. But as per your suggestion, there will be at least 2 copies of __rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail. Thus, if there are any bugs in the future, both 2 copies have to be changed, right? /*
Re: [dpdk-dev] [PATCH v2] ring: guarantee ordering of cons/prod loading when doing
Hi Jerin On 11/3/2017 1:23 AM, Jerin Jacob Wrote: -Original Message- Date: Thu, 2 Nov 2017 08:43:30 + From: Jia He To: jerin.ja...@caviumnetworks.com, dev@dpdk.org, olivier.m...@6wind.com Cc: konstantin.anan...@intel.com, bruce.richard...@intel.com, jianbo@arm.com, hemant.agra...@nxp.com, Jia He , jie2@hxt-semitech.com, bing.z...@hxt-semitech.com, jia...@hxt-semitech.com Subject: [PATCH v2] ring: guarantee ordering of cons/prod loading when doing X-Mailer: git-send-email 2.7.4 We watched a rte panic of mbuf_autotest in our qualcomm arm64 server. As for the possible race condition, please refer to [1]. Hi Jia, In addition to Konstantin comments. Please find below some review comments. Furthermore, there are 2 options as suggested by Jerin: 1. use rte_smp_rmb 2. use load_acquire/store_release(refer to [2]). CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER is provided, and by I think, The better name would be CONFIG_RTE_RING_USE_C11_MEM_MODEL or something like that. Ok, but how to distinguish following 2 options? CONFIG_RTE_RING_USE_C11_MEM_MODEL doesn't seem to be enough 1. use rte_smp_rmb 2. use load_acquire/store_release(refer to [2]). default it is n; --- Changelog: V2: let users choose whether using load_acquire/store_release V1: rte_smp_rmb() between 2 loads Signed-off-by: Jia He Signed-off-by: jie2@hxt-semitech.com Signed-off-by: bing.z...@hxt-semitech.com Signed-off-by: jia...@hxt-semitech.com Suggested-by: jerin.ja...@caviumnetworks.com --- lib/librte_ring/Makefile | 4 +++- lib/librte_ring/rte_ring.h | 38 -- lib/librte_ring/rte_ring_arm64.h | 48 ++ lib/librte_ring/rte_ring_generic.h | 45 +++ 4 files changed, 127 insertions(+), 8 deletions(-) create mode 100644 lib/librte_ring/rte_ring_arm64.h create mode 100644 lib/librte_ring/rte_ring_generic.h diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile index e34d9d9..fa57a86 100644 --- a/lib/librte_ring/Makefile +++ b/lib/librte_ring/Makefile @@ -45,6 +45,8 @@ LIBABIVER := 1 SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c # install includes -SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h +SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \ + rte_ring_arm64.h \ It is really not specific to arm64. We could rename it to rte_ring_c11_mem.h or something like that to reflect the implementation based on c11 memory model. + rte_ring_generic.h include $(RTE_SDK)/mk/rte.lib.mk diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index 5e9b3b7..943b1f9 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -103,6 +103,18 @@ extern "C" { #include #include +/* In those strong memory models (e.g. x86), there is no need to add rmb() + * between load and load. + * In those weak models(powerpc/arm), there are 2 choices for the users + * 1.use rmb() memory barrier + * 2.use one-direcion load_acquire/store_release barrier + * It depends on performance test results. */ +#ifdef RTE_ARCH_ARM64 s/RTE_ARCH_ARM64/RTE_RING_USE_C11_MEM_MODEL and update the generic arm64 config. By that way it can used by another architecture like ppc if they choose to do so. +#include "rte_ring_arm64.h" +#else +#include "rte_ring_generic.h" +#endif + #define RTE_TAILQ_RING_NAME "RTE_RING" enum rte_ring_queue_behavior { @@ -368,7 +380,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val, while (unlikely(ht->tail != old_val)) rte_pause(); - ht->tail = new_val; + arch_rte_atomic_store(&ht->tail, new_val, __ATOMIC_RELEASE); } /** @@ -408,7 +420,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, /* Reset n to the initial burst count */ n = max; - *old_head = r->prod.head; + *old_head = arch_rte_atomic_load(&r->prod.head, + __ATOMIC_ACQUIRE); Same as Konstantin comments, i.e move to this function to c11 memory model header file const uint32_t cons_tail = r->cons.tail; /* * The subtraction is done between two unsigned 32bits value @@ -430,8 +443,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, if (is_sp) r->prod.head = *new_head, success = 1; else - success = rte_atomic32_cmpset(&r->prod.head, - *old_head, *new_head); + success = arch_rte_atomic32_cmpset(&r->prod.head, + old_head, *new_head, +
Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue
Hi Jerin On 11/2/2017 4:57 PM, Jia He Wrote: Hi, Jerin please see my performance test below On 11/2/2017 3:04 AM, Jerin Jacob Wrote: [...] Should it be like instead? +#else + *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE); + const uint32_t prod_tail = __atomic_load_n(&r->prod.tail, __ATOMIC_ACQUIRE); It would be nice to see how much overhead it gives.ie back to back __ATOMIC_ACQUIRE. I can NOT test ring_perf_autotest in our server because of the something wrong in PMU counter. All the return value of rte_rdtsc is 0 with and without your provided ko module. I am still investigating the reason. Hi Jerin As for the root cause of rte_rdtsc issue, it might be due to the pmu counter frequency is too low in our arm64 server("Amberwing" from qualcom) [586990.057779] arch_timer_get_cntfrq()=2000 Only 20MHz instead of 100M/200MHz, and CNTFRQ_EL0 is not even writable in kernel space. Maybe the code in ring_perf_autotest needs to be changed? e.g. printf("SC empty dequeue: %.2F\n", (double)(sc_end-sc_start) / iterations); printf("MC empty dequeue: %.2F\n", (double)(mc_end-mc_start) / iterations); Otherwise it is always 0 if the time difference divides by iterations. -- Cheers, Jia
Re: [dpdk-dev] [PATCH v2] ring: guarantee ordering of cons/prod loading when doing
Hi Jerin On 11/3/2017 8:56 PM, Jerin Jacob Wrote: -Original Message- [...] g like that. Ok, but how to distinguish following 2 options? No clearly understood this question. For arm64 case, you can add CONFIG_RTE_RING_USE_C11_MEM_MODEL=y in config/defconfig_arm64-armv8a-* Sorry for my unclear expressions. I mean there should be one additional config macro besides CONFIG_RTE_RING_USE_C11_MEM_MODEL for users to choose? i.e. - On X86:CONFIG_RTE_RING_USE_C11_MEM_MODEL=n include rte_ring_generic.h, no changes - On arm64,CONFIG_RTE_RING_USE_C11_MEM_MODEL=y include rte_ring_c11_mem.h by default. In rte_ring_c11_mem.h, implement new version of __rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail Then, how to distinguish the option of using rte_smp_rmb() or __atomic_load/store_n()? Thanks for the clarification. -- Cheers, Jia
Re: [dpdk-dev] [PATCH v2] ring: guarantee ordering of cons/prod loading when doing
On 11/7/2017 12:36 PM, Jerin Jacob Wrote: -Original Message- On option could be to change the prototype of update_tail() and make compiler accommodate it for zero cost for arm64(Which I think, it it the case. But you can check the generated instructions) If not, move, __rte_ring_do_dequeue() and __rte_ring_do_enqueue() instead of __rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail() ➜ [master][dpdk.org] $ git diff diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index 5e9b3b7b4..b32648825 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -358,8 +358,12 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r); static __rte_always_inline void update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val, - uint32_t single) + uint32_t single, const uint32_t enqueue) { + if (enqueue) + rte_smp_wmb(); + else + rte_smp_rmb(); /* * If there are other enqueues/dequeues in progress that * preceded us, * we need to wait for them to complete @@ -470,9 +474,8 @@ __rte_ring_do_enqueue(struct rte_ring *r, void * const *obj_table, goto end; ENQUEUE_PTRS(r, &r[1], prod_head, obj_table, n, void *); - rte_smp_wmb(); - update_tail(&r->prod, prod_head, prod_next, is_sp); + update_tail(&r->prod, prod_head, prod_next, is_sp, 1); end: if (free_space != NULL) *free_space = free_entries - n; @@ -575,9 +578,8 @@ __rte_ring_do_dequeue(struct rte_ring *r, void **obj_table, goto end; DEQUEUE_PTRS(r, &r[1], cons_head, obj_table, n, void *); - rte_smp_rmb(); - update_tail(&r->cons, cons_head, cons_next, is_sc); + update_tail(&r->cons, cons_head, cons_next, is_sc, 0); end: if (available != NULL) Hi Jerin, yes I knew this suggestion in update_tail. But what I mean is the rte_smp_rmb() in __rte_ring_move_cons_head and __rte_ring_move_pros_head: [option 1] + *old_head = r->cons.head; + rte_smp_rmb(); + const uint32_t prod_tail = r->prod.tail; [option 2] + *old_head = __atomic_load_n(&r->cons.head, + __ATOMIC_ACQUIRE); + *old_head = r->cons.head; ie.I wonder what is the suitable new config name to distinguish the above 2 options? Thanks for the patience :-) see my drafted patch below, the marcro "PREFER": diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h new file mode 100644 index 000..22fe887 --- /dev/null +++ b/lib/librte_ring/rte_ring_c11_mem.h @@ -0,0 +1,305 @@ +/*- + * BSD LICENSE + * + * Copyright(c) 2017 hxt-semitech. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * * Neither the name of hxt-semitech nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +#ifndef _RTE_RING_C11_MEM_H_ +#define _RTE_RING_C11_MEM_H_ + +static __rte_always_inline void +update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val, + uint32_t single, uint32_t enqueue) +{ + /* Don't need wmb/rmb when we prefer to use load_acquire/ + * store_release barrier */ +#ifndef PREFER + if (enqueue) + rte_smp_wmb(); + else + rte_smp_rmb(); +#endif + + /* + * If there are other enqueues/dequeues in progress that preceded us, + * we need to wait for them to complete + */ + if (!single) + w
Re: [dpdk-dev] [PATCH v2] ring: guarantee ordering of cons/prod loading when doing
Hi Jerin Thank you, I mistakenly think x86 doen't need rte_smp_rmb(). Since rte_smp_rmb() only impact x86's compiler optimization, I will simplify the codes as your suggestions Cheers, Jia On 11/7/2017 5:57 PM, Jerin Jacob Wrote: -Original Message- Date: Tue, 7 Nov 2017 16:34:30 +0800 From: Jia He To: Jerin Jacob Cc: dev@dpdk.org, olivier.m...@6wind.com, konstantin.anan...@intel.com, bruce.richard...@intel.com, jianbo@arm.com, hemant.agra...@nxp.com, jie2@hxt-semitech.com, bing.z...@hxt-semitech.com, jia...@hxt-semitech.com Subject: Re: [PATCH v2] ring: guarantee ordering of cons/prod loading when doing User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 On 11/7/2017 12:36 PM, Jerin Jacob Wrote: -Original Message- On option could be to change the prototype of update_tail() and make compiler accommodate it for zero cost for arm64(Which I think, it it the case. But you can check the generated instructions) If not, move, __rte_ring_do_dequeue() and __rte_ring_do_enqueue() instead of __rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail() ➜ [master][dpdk.org] $ git diff diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index 5e9b3b7b4..b32648825 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -358,8 +358,12 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r); static __rte_always_inline void update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val, - uint32_t single) + uint32_t single, const uint32_t enqueue) { + if (enqueue) + rte_smp_wmb(); + else + rte_smp_rmb(); /* * If there are other enqueues/dequeues in progress that * preceded us, * we need to wait for them to complete @@ -470,9 +474,8 @@ __rte_ring_do_enqueue(struct rte_ring *r, void * const *obj_table, goto end; ENQUEUE_PTRS(r, &r[1], prod_head, obj_table, n, void *); - rte_smp_wmb(); - update_tail(&r->prod, prod_head, prod_next, is_sp); + update_tail(&r->prod, prod_head, prod_next, is_sp, 1); end: if (free_space != NULL) *free_space = free_entries - n; @@ -575,9 +578,8 @@ __rte_ring_do_dequeue(struct rte_ring *r, void **obj_table, goto end; DEQUEUE_PTRS(r, &r[1], cons_head, obj_table, n, void *); - rte_smp_rmb(); - update_tail(&r->cons, cons_head, cons_next, is_sc); + update_tail(&r->cons, cons_head, cons_next, is_sc, 0); end: if (available != NULL) Hi Jerin, yes I knew this suggestion in update_tail. But what I mean is the rte_smp_rmb() in __rte_ring_move_cons_head and __rte_ring_move_pros_head: [option 1] + *old_head = r->cons.head; + rte_smp_rmb(); + const uint32_t prod_tail = r->prod.tail; [option 2] + *old_head = __atomic_load_n(&r->cons.head, + __ATOMIC_ACQUIRE); + *old_head = r->cons.head; ie.I wonder what is the suitable new config name to distinguish the above 2 options? Why? If you fix the generic version with rte_smp_rmb() then we just need only one config to differentiate between c11 vs generic. See comments below, Thanks for the patience :-) see my drafted patch below, the marcro "PREFER": + */ + +#ifndef _RTE_RING_C11_MEM_H_ +#define _RTE_RING_C11_MEM_H_ + +static __rte_always_inline void +update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val, + uint32_t single, uint32_t enqueue) +{ + /* Don't need wmb/rmb when we prefer to use load_acquire/ + * store_release barrier */ +#ifndef PREFER + if (enqueue) + rte_smp_wmb(); + else + rte_smp_rmb(); +#endif You can remove PREFER and let the "generic" version has this. For x86, rte_smp_?mb() it will be NOOP. So no issue. + + /* + * If there are other enqueues/dequeues in progress that preceded us, + * we need to wait for them to complete + */ + if (!single) + while (unlikely(ht->tail != old_val)) + rte_pause(); + +#ifdef PREFER + __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE); for c11 mem model version, it needs only __atomic_store_n version. +#else + ht->tail = new_val; +#endif +} + +/** + * @internal This function updates the producer head for enqueue + * + * @param r + * A pointer to the ring structure + * @param is_sp + * Indicates whether multi-producer path is needed or not + * @param n + * The number of elements we will want to enqueue, i.e. how far should the + * head be moved + * @param behavior + * RTE_RING_QUEUE_FIXED: Enqueue a fixed number of items from a ring + * RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring + * @param old_head + * Returns h
[dpdk-dev] [PATCH 2/3] ring: guarantee load ordering of cons/prod when doing enqueue/dequeue
We watched a rte panic of mbuf_autotest in our qualcomm arm64 server. As for the possible race condition, please refer to [1]. Furthermore, to fix this race, there are 2 options as suggested by Jerin: 1. use rte_smp_rmb 2. use load_acquire/store_release(refer to [2]). CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "y" only on arm64. The reason why providing 2 options is due to the performance benchmark difference in different arm machines, please refer to [3]. Already fuctionally tested on the machines as follows: - on X86 - on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=y - on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=n We haven't tested the ppc64 case. If anyone verifies it, he can add CONFIG_RTE_RING_USE_C11_MEM_MODEL=y to ppc64 config files. [1] http://dpdk.org/ml/archives/dev/2017-October/078366.html [2] https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170 [3] http://dpdk.org/ml/archives/dev/2017-October/080861.html --- Changelog: V3: arch specific implementation for enqueue/dequeue barrier V2: let users choose whether using load_acquire/store_release V1: rte_smp_rmb() between 2 loads Signed-off-by: Jia He Signed-off-by: jie2@hxt-semitech.com Signed-off-by: bing.z...@hxt-semitech.com Signed-off-by: jia...@hxt-semitech.com Suggested-by: jerin.ja...@caviumnetworks.com Suggested-by: konstantin.anan...@intel.com --- lib/librte_eventdev/rte_event_ring.h | 6 +- lib/librte_ring/Makefile | 4 +- lib/librte_ring/rte_ring.h | 160 +++-- lib/librte_ring/rte_ring_c11_mem.h | 185 + lib/librte_ring/rte_ring_generic.h | 192 +++ 5 files changed, 397 insertions(+), 150 deletions(-) create mode 100644 lib/librte_ring/rte_ring_c11_mem.h create mode 100644 lib/librte_ring/rte_ring_generic.h diff --git a/lib/librte_eventdev/rte_event_ring.h b/lib/librte_eventdev/rte_event_ring.h index ea9b688..3e49458 100644 --- a/lib/librte_eventdev/rte_event_ring.h +++ b/lib/librte_eventdev/rte_event_ring.h @@ -126,9 +126,8 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r, goto end; ENQUEUE_PTRS(&r->r, &r[1], prod_head, events, n, struct rte_event); - rte_smp_wmb(); - update_tail(&r->r.prod, prod_head, prod_next, 1); + update_tail(&r->r.prod, prod_head, prod_next, 1, 1); end: if (free_space != NULL) *free_space = free_entries - n; @@ -168,9 +167,8 @@ rte_event_ring_dequeue_burst(struct rte_event_ring *r, goto end; DEQUEUE_PTRS(&r->r, &r[1], cons_head, events, n, struct rte_event); - rte_smp_rmb(); - update_tail(&r->r.cons, cons_head, cons_next, 1); + update_tail(&r->r.cons, cons_head, cons_next, 1, 0); end: if (available != NULL) diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile index e34d9d9..a2682e7 100644 --- a/lib/librte_ring/Makefile +++ b/lib/librte_ring/Makefile @@ -45,6 +45,8 @@ LIBABIVER := 1 SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c # install includes -SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h +SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \ + rte_ring_generic.h \ + rte_ring_c11_mem.h include $(RTE_SDK)/mk/rte.lib.mk diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index 5e9b3b7..53764d9 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -356,85 +356,19 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r); } \ } while (0) -static __rte_always_inline void -update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val, - uint32_t single) -{ - /* -* If there are other enqueues/dequeues in progress that preceded us, -* we need to wait for them to complete -*/ - if (!single) - while (unlikely(ht->tail != old_val)) - rte_pause(); - - ht->tail = new_val; -} - -/** - * @internal This function updates the producer head for enqueue - * - * @param r - * A pointer to the ring structure - * @param is_sp - * Indicates whether multi-producer path is needed or not - * @param n - * The number of elements we will want to enqueue, i.e. how far should the - * head be moved - * @param behavior - * RTE_RING_QUEUE_FIXED:Enqueue a fixed number of items from a ring - * RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring - * @param old_head - * Returns head value as it was before the move, i.e. where enqueue starts - * @param new_head - * Returns the current/new head value i.e. where enqueue finishes - * @param free_entries - * Returns the amount of free space in the ring BEFORE head was moved - * @return - * Actual number of objects enqueu
[dpdk-dev] [PATCH 1/3] eal/arm64: remove the braces {} for dmb(), dsb()
for the code as follows: if (condition) rte_smp_rmb(); else rte_smp_wmb(); Without this patch, compiler will report this error: error: 'else' without a previous 'if' Signed-off-by: Jia He Signed-off-by: jia...@hxt-semitech.com --- lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h index 0b70d62..38c3393 100644 --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h @@ -43,8 +43,8 @@ extern "C" { #include "generic/rte_atomic.h" -#define dsb(opt) { asm volatile("dsb " #opt : : : "memory"); } -#define dmb(opt) { asm volatile("dmb " #opt : : : "memory"); } +#define dsb(opt) asm volatile("dsb " #opt : : : "memory"); +#define dmb(opt) asm volatile("dmb " #opt : : : "memory"); #define rte_mb() dsb(sy) -- 2.7.4
[dpdk-dev] [PATCH 3/3] config: support C11 memory model for arm64
by default, CONFIG_RTE_RING_USE_C11_MEM_MODEL is y on arm64 Signed-off-by: Jia He Signed-off-by: jia...@hxt-semitech.com --- config/common_armv8a_linuxapp | 1 + 1 file changed, 1 insertion(+) diff --git a/config/common_armv8a_linuxapp b/config/common_armv8a_linuxapp index 6732d1e..1bf6e4d 100644 --- a/config/common_armv8a_linuxapp +++ b/config/common_armv8a_linuxapp @@ -49,3 +49,4 @@ CONFIG_RTE_LIBRTE_SFC_EFX_PMD=n CONFIG_RTE_LIBRTE_AVP_PMD=n CONFIG_RTE_SCHED_VECTOR=n +CONFIG_RTE_RING_USE_C11_MEM_MODEL=y -- 2.7.4
[dpdk-dev] [PATCH v4 3/4] ring: introduce new header file to include common functions
move the common part of rte_ring.h into rte_ring_generic.h move the memory barrier part into update_tail() no functional changes here Signed-off-by: Jia He Suggested-by: Jerin Jacob Suggested-by: Ananyev, Konstantin --- lib/librte_eventdev/rte_event_ring.h | 6 +- lib/librte_ring/Makefile | 3 +- lib/librte_ring/rte_ring.h | 159 +--- lib/librte_ring/rte_ring_generic.h | 194 +++ 4 files changed, 202 insertions(+), 160 deletions(-) create mode 100644 lib/librte_ring/rte_ring_generic.h diff --git a/lib/librte_eventdev/rte_event_ring.h b/lib/librte_eventdev/rte_event_ring.h index ea9b688..3e49458 100644 --- a/lib/librte_eventdev/rte_event_ring.h +++ b/lib/librte_eventdev/rte_event_ring.h @@ -126,9 +126,8 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r, goto end; ENQUEUE_PTRS(&r->r, &r[1], prod_head, events, n, struct rte_event); - rte_smp_wmb(); - update_tail(&r->r.prod, prod_head, prod_next, 1); + update_tail(&r->r.prod, prod_head, prod_next, 1, 1); end: if (free_space != NULL) *free_space = free_entries - n; @@ -168,9 +167,8 @@ rte_event_ring_dequeue_burst(struct rte_event_ring *r, goto end; DEQUEUE_PTRS(&r->r, &r[1], cons_head, events, n, struct rte_event); - rte_smp_rmb(); - update_tail(&r->r.cons, cons_head, cons_next, 1); + update_tail(&r->r.cons, cons_head, cons_next, 1, 0); end: if (available != NULL) diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile index e34d9d9..c959945 100644 --- a/lib/librte_ring/Makefile +++ b/lib/librte_ring/Makefile @@ -45,6 +45,7 @@ LIBABIVER := 1 SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c # install includes -SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h +SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \ + rte_ring_generic.h include $(RTE_SDK)/mk/rte.lib.mk diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index 3e8085a..519614c 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -356,90 +356,8 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r); } \ } while (0) -static __rte_always_inline void -update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val, - uint32_t single) -{ - /* -* If there are other enqueues/dequeues in progress that preceded us, -* we need to wait for them to complete -*/ - if (!single) - while (unlikely(ht->tail != old_val)) - rte_pause(); - - ht->tail = new_val; -} - -/** - * @internal This function updates the producer head for enqueue - * - * @param r - * A pointer to the ring structure - * @param is_sp - * Indicates whether multi-producer path is needed or not - * @param n - * The number of elements we will want to enqueue, i.e. how far should the - * head be moved - * @param behavior - * RTE_RING_QUEUE_FIXED:Enqueue a fixed number of items from a ring - * RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring - * @param old_head - * Returns head value as it was before the move, i.e. where enqueue starts - * @param new_head - * Returns the current/new head value i.e. where enqueue finishes - * @param free_entries - * Returns the amount of free space in the ring BEFORE head was moved - * @return - * Actual number of objects enqueued. - * If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only. - */ -static __rte_always_inline unsigned int -__rte_ring_move_prod_head(struct rte_ring *r, int is_sp, - unsigned int n, enum rte_ring_queue_behavior behavior, - uint32_t *old_head, uint32_t *new_head, - uint32_t *free_entries) -{ - const uint32_t capacity = r->capacity; - unsigned int max = n; - int success; - - do { - /* Reset n to the initial burst count */ - n = max; - - *old_head = r->prod.head; - - /* add rmb barrier to avoid load/load reorder in weak -* memory model. It is noop on x86 */ - rte_smp_rmb(); - - const uint32_t cons_tail = r->cons.tail; - /* -* The subtraction is done between two unsigned 32bits value -* (the result is always modulo 32 bits even if we have -* *old_head > cons_tail). So 'free_entries' is always between 0 -* and capacity (which is < size). -*/ - *free_entries = (capacity + cons_tail - *old_head); - - /* check that we have enough room in ring */ - if (unlikely(n > *free_entries)) - n = (behavior
[dpdk-dev] [PATCH v4 0/4] fix race condition in enqueue/dequeue because of cpu reorder
We watched a rte panic of mbuf_autotest in our qualcomm arm64 server due to a possible race condition. To fix this race, there are 2 options as suggested by Jerin: 1. use rte_smp_rmb 2. use load_acquire/store_release(refer to [2]). CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "y" only on arm64 so far. The reason why providing 2 options is due to the performance benchmark difference in different arm machines. Already fuctionally tested on the machines as follows: - on X86 - on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=y - on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=n --- Changelog: V4: split into small patches V3: arch specific implementation for enqueue/dequeue barrier V2: let users choose whether using load_acquire/store_release V1: rte_smp_rmb() between 2 loads Jia He (4): eal/arm64: remove the braces {} for dmb() and dsb() ring: guarantee load/load order in enqueue and dequeue ring: introduce new header file to include common functions ring: introduce new header file to support C11 memory model config/common_armv8a_linuxapp | 2 + .../common/include/arch/arm/rte_atomic_64.h| 4 +- lib/librte_eventdev/rte_event_ring.h | 6 +- lib/librte_ring/Makefile | 4 +- lib/librte_ring/rte_ring.h | 161 ++--- lib/librte_ring/rte_ring_c11_mem.h | 185 lib/librte_ring/rte_ring_generic.h | 194 + 7 files changed, 404 insertions(+), 152 deletions(-) create mode 100644 lib/librte_ring/rte_ring_c11_mem.h create mode 100644 lib/librte_ring/rte_ring_generic.h -- 2.7.4
[dpdk-dev] [PATCH v4 2/4] ring: guarantee load/load order in enqueue and dequeue
We watched a rte panic of mbuf_autotest in our qualcomm arm64 server. In __rte_ring_move_cons_head() ... do { /* Restore n as it may change every loop */ n = max; *old_head = r->cons.head;//1st load const uint32_t prod_tail = r->prod.tail; //2nd load In weak memory order architectures(powerpc,arm), the 2nd load might be reodered before the 1st load, that makes *entries is bigger than we wanted. This nasty reording messed enque/deque up. cpu1(producer) cpu2(consumer) cpu3(consumer) load r->prod.tail in enqueue: load r->cons.tail load r->prod.head store r->prod.tail load r->cons.head load r->prod.tail ... store r->cons.{head,tail} load r->cons.head Then, r->cons.head will be bigger than prod_tail, then make *entries very big and the consumer will go forward incorrectly. After this patch, the old cons.head will be recaculated after failure of rte_atomic32_cmpset There is no such issue on X86, because X86 is strong memory order model. Signed-off-by: Jia He Signed-off-by: jie2@hxt-semitech.com Signed-off-by: bing.z...@hxt-semitech.com --- lib/librte_ring/rte_ring.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index 5e9b3b7..3e8085a 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -409,6 +409,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, n = max; *old_head = r->prod.head; + + /* add rmb barrier to avoid load/load reorder in weak +* memory model. It is noop on x86 */ + rte_smp_rmb(); + const uint32_t cons_tail = r->cons.tail; /* * The subtraction is done between two unsigned 32bits value @@ -517,6 +522,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, n = max; *old_head = r->cons.head; + + /* add rmb barrier to avoid load/load reorder in weak +* memory model. It is noop on x86 */ + rte_smp_rmb(); + const uint32_t prod_tail = r->prod.tail; /* The subtraction is done between two unsigned 32bits value * (the result is always modulo 32 bits even if we have -- 2.7.4
[dpdk-dev] [PATCH v4 4/4] ring: introduce new header file to support C11 memory model
To fix the cpu reorder race condition in enque/deque, there are 2 options suggested by Jerin: 1. use rte_smp_rmb 2. use load_acquire/store_release(refer to [1]). for the 2nd option, CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "y" only on arm64. The reason why providing 2 options is due to the performance benchmark difference in different arm machines, refer to [2]. We haven't tested on ppc64. If anyone verifies it, he can add CONFIG_RTE_RING_USE_C11_MEM_MODEL=y to ppc64 config files. [1] https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170 [2] http://dpdk.org/ml/archives/dev/2017-October/080861.html Signed-off-by: Jia He Suggested-by: Jerin Jacob --- config/common_armv8a_linuxapp | 2 + lib/librte_ring/Makefile | 3 +- lib/librte_ring/rte_ring.h | 14 ++- lib/librte_ring/rte_ring_c11_mem.h | 185 + 4 files changed, 202 insertions(+), 2 deletions(-) create mode 100644 lib/librte_ring/rte_ring_c11_mem.h diff --git a/config/common_armv8a_linuxapp b/config/common_armv8a_linuxapp index 6732d1e..5b7b2eb 100644 --- a/config/common_armv8a_linuxapp +++ b/config/common_armv8a_linuxapp @@ -49,3 +49,5 @@ CONFIG_RTE_LIBRTE_SFC_EFX_PMD=n CONFIG_RTE_LIBRTE_AVP_PMD=n CONFIG_RTE_SCHED_VECTOR=n + +CONFIG_RTE_RING_USE_C11_MEM_MODEL=y diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile index c959945..a2682e7 100644 --- a/lib/librte_ring/Makefile +++ b/lib/librte_ring/Makefile @@ -46,6 +46,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c # install includes SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \ - rte_ring_generic.h + rte_ring_generic.h \ + rte_ring_c11_mem.h include $(RTE_SDK)/mk/rte.lib.mk diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index 519614c..3343eba 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -356,8 +356,20 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r); } \ } while (0) -/* Move common functions to generic file */ +/* Between load and load. there might be cpu reorder in weak model + * (powerpc/arm). + * There are 2 choices for the users + * 1.use rmb() memory barrier + * 2.use one-direcion load_acquire/store_release barrier,defined by + * CONFIG_RTE_RING_USE_C11_MEM_MODEL=y + * It depends on performance test results. + * By default, move common functions to rte_ring_generic.h + */ +#ifdef RTE_RING_USE_C11_MEM_MODEL +#include "rte_ring_c11_mem.h" +#else #include "rte_ring_generic.h" +#endif /** * @internal Enqueue several objects on the ring diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h new file mode 100644 index 000..c167b46 --- /dev/null +++ b/lib/librte_ring/rte_ring_c11_mem.h @@ -0,0 +1,185 @@ +/*- + * BSD LICENSE + * + * Copyright(c) 2017 hxt-semitech. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * * Neither the name of hxt-semitech nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +#ifndef _RTE_RING_C11_MEM_H_ +#define _RTE_RING_C11_MEM_H_ + +static __rte_always_inline void +update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val, + uint32_t single, uint32_t enqueue) +{ + RTE_SET_USED(enqueue); + + /* +* If there are other enqueues/dequeues in progress tha
[dpdk-dev] [PATCH v4 1/4] eal/arm64: remove the braces {} for dmb() and dsb()
for the code as follows: if (condition) rte_smp_rmb(); else rte_smp_wmb(); Without this patch, compiler will report this error: error: 'else' without a previous 'if' Fixes: 84733fd0d75e ("eal/arm64: fix memory barrier definition") Signed-off-by: Jia He --- lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h index 0b70d62..71da29c 100644 --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h @@ -43,8 +43,8 @@ extern "C" { #include "generic/rte_atomic.h" -#define dsb(opt) { asm volatile("dsb " #opt : : : "memory"); } -#define dmb(opt) { asm volatile("dmb " #opt : : : "memory"); } +#define dsb(opt) asm volatile("dsb " #opt : : : "memory") +#define dmb(opt) asm volatile("dmb " #opt : : : "memory") #define rte_mb() dsb(sy) -- 2.7.4
Re: [dpdk-dev] [PATCH v4 0/4] fix race condition in enqueue/dequeue because of cpu reorder
Hi Bruce On 11/8/2017 8:15 PM, Bruce Richardson Wrote: On Wed, Nov 08, 2017 at 09:54:37AM +, Jia He wrote: We watched a rte panic of mbuf_autotest in our qualcomm arm64 server due to a possible race condition. To fix this race, there are 2 options as suggested by Jerin: 1. use rte_smp_rmb 2. use load_acquire/store_release(refer to [2]). CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "y" only on arm64 so far. The reason why providing 2 options is due to the performance benchmark difference in different arm machines. Already fuctionally tested on the machines as follows: - on X86 - on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=y - on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=n --- Changelog: V4: split into small patches V3: arch specific implementation for enqueue/dequeue barrier V2: let users choose whether using load_acquire/store_release V1: rte_smp_rmb() between 2 loads Jia He (4): eal/arm64: remove the braces {} for dmb() and dsb() ring: guarantee load/load order in enqueue and dequeue ring: introduce new header file to include common functions ring: introduce new header file to support C11 memory model I'm wondering what the merge plans are for this set, given we are now past RC3 in 17.11? As the rings are broken on ARM machines we need to merge in some fix, but I'm a little concerned about the scope of the changes from the 3rd and 4th patches. Would it be acceptable to just merge in patches 1 & 2 in 17.11 and leave the rework and C11 memory model additions in patches 3 & 4 to 18.02 release? As far as I'm concerned, it is ok. Cheers, Jia
Re: [dpdk-dev] [PATCH 1/3] eal/arm64: remove the braces {} for dmb(), dsb()
Hi Bruce On 11/8/2017 6:28 PM, Bruce Richardson Wrote: On Wed, Nov 08, 2017 at 06:17:10AM +, Jia He wrote: for the code as follows: if (condition) rte_smp_rmb(); else rte_smp_wmb(); Without this patch, compiler will report this error: error: 'else' without a previous 'if' Signed-off-by: Jia He Signed-off-by: jia...@hxt-semitech.com --- lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h index 0b70d62..38c3393 100644 --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h @@ -43,8 +43,8 @@ extern "C" { #include "generic/rte_atomic.h" -#define dsb(opt) { asm volatile("dsb " #opt : : : "memory"); } -#define dmb(opt) { asm volatile("dmb " #opt : : : "memory"); } +#define dsb(opt) asm volatile("dsb " #opt : : : "memory"); +#define dmb(opt) asm volatile("dmb " #opt : : : "memory"); Need to remove the trailing ";" I too I think. Alternatively, to keep the braces, the standard practice is to use do { ... } while(0) If trailing ";" is not removed the code: if (condition) rte_smp_rmb(); else anything(); will be like below after precompiling: if (condition) asm volatile("dsb " "ld" : : : "memory");; else anything(); Then, the same error - error: 'else' without a previous 'if' If you choose do/while(0), yes, no errors from compiler. But the checkpatch will report WARNING: Single statement macros should not use a do {} while (0) loop #11: FILE: lib/librte_eal/common/include/arch/arm/rte_atomic_64.h:46: +#define dsb(opt) do { asm volatile("dsb " #opt : : : "memory"); } while (0) I searched the kernel codes, the marco dsb() didn't use the do/while(0). Which one do you think is better for dpdk? Cheers, Jia
Re: [dpdk-dev] [PATCH 1/3] eal/arm64: remove the braces {} for dmb(), dsb()
On 11/9/2017 9:22 AM, Jia He Wrote: Hi Bruce On 11/8/2017 6:28 PM, Bruce Richardson Wrote: On Wed, Nov 08, 2017 at 06:17:10AM +, Jia He wrote: for the code as follows: if (condition) rte_smp_rmb(); else rte_smp_wmb(); Without this patch, compiler will report this error: error: 'else' without a previous 'if' Signed-off-by: Jia He Signed-off-by: jia...@hxt-semitech.com --- lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h index 0b70d62..38c3393 100644 --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h @@ -43,8 +43,8 @@ extern "C" { #include "generic/rte_atomic.h" -#define dsb(opt) { asm volatile("dsb " #opt : : : "memory"); } -#define dmb(opt) { asm volatile("dmb " #opt : : : "memory"); } +#define dsb(opt) asm volatile("dsb " #opt : : : "memory"); +#define dmb(opt) asm volatile("dmb " #opt : : : "memory"); Need to remove the trailing ";" I too I think. Alternatively, to keep the braces, the standard practice is to use do { ... } while(0) If trailing ";" is not removed the code: if (condition) rte_smp_rmb(); else anything(); will be like below after precompiling: if (condition) asm volatile("dsb " "ld" : : : "memory");; else anything(); Then, the same error - error: 'else' without a previous 'if' Ignore my words above,thanks sorry for the inconvenience. And I've sent out v4 in this mail thread. The ";" has been removed. If no more comments, I will send out v5 (2 patch sets for 17 and 18) -- Cheers, Jia -- Cheers, Jia
Re: [dpdk-dev] [PATCH 1/3] eal/arm64: remove the braces {} for dmb(), dsb()
Hi Jianbo On 11/9/2017 11:21 AM, Jianbo Liu Wrote: The 11/09/2017 11:14, Jia He wrote: On 11/9/2017 9:22 AM, Jia He Wrote: Hi Bruce On 11/8/2017 6:28 PM, Bruce Richardson Wrote: On Wed, Nov 08, 2017 at 06:17:10AM +, Jia He wrote: for the code as follows: if (condition) rte_smp_rmb(); else rte_smp_wmb(); Without this patch, compiler will report this error: error: 'else' without a previous 'if' Signed-off-by: Jia He Signed-off-by: jia...@hxt-semitech.com --- lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h index 0b70d62..38c3393 100644 --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h @@ -43,8 +43,8 @@ extern "C" { #include "generic/rte_atomic.h" -#define dsb(opt) { asm volatile("dsb " #opt : : : "memory"); } -#define dmb(opt) { asm volatile("dmb " #opt : : : "memory"); } +#define dsb(opt) asm volatile("dsb " #opt : : : "memory"); +#define dmb(opt) asm volatile("dmb " #opt : : : "memory"); Need to remove the trailing ";" I too I think. Alternatively, to keep the braces, the standard practice is to use do { ... } while(0) If trailing ";" is not removed the code: if (condition) rte_smp_rmb(); else anything(); Sorry, why not use two different functions as your conditions passed in are fixed in the calling functions. Do you mean to split update_tail() into update_tail_enqueue() and update_tail_dequeue()? Cheers, Jia
[dpdk-dev] [PATCH v5 1/1] ring: guarantee load/load order in enqueue and dequeue
We watched a rte panic of mbuf_autotest in our qualcomm arm64 server. In __rte_ring_move_cons_head() ... do { /* Restore n as it may change every loop */ n = max; *old_head = r->cons.head;//1st load const uint32_t prod_tail = r->prod.tail; //2nd load cpu1(producer) cpu2(consumer) cpu3(consumer) load r->prod.tail in enqueue: load r->cons.tail load r->prod.head store r->prod.tail load r->cons.head load r->prod.tail ... store r->cons.{head,tail} load r->cons.head In weak memory order architectures(powerpc,arm), the 2nd load might be reodered before the 1st load, that makes *entries is bigger than we wanted. This nasty reording messed enque/deque up. Then, r->cons.head will be bigger than prod_tail, then make *entries very big and the consumer will go forward incorrectly. After this patch, even with above context switches, the old cons.head will be recaculated after failure of rte_atomic32_cmpset. So no race conditions left. There is no such issue on X86, because X86 is strong memory order model. But rte_smp_rmb() doesn't have impact on runtime performance on X86, so keep the same code without architectures specific concerns. Signed-off-by: Jia He Signed-off-by: jie2@hxt-semitech.com Signed-off-by: bing.z...@hxt-semitech.com --- lib/librte_ring/rte_ring.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index 5e9b3b7..3e8085a 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -409,6 +409,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, n = max; *old_head = r->prod.head; + + /* add rmb barrier to avoid load/load reorder in weak +* memory model. It is noop on x86 */ + rte_smp_rmb(); + const uint32_t cons_tail = r->cons.tail; /* * The subtraction is done between two unsigned 32bits value @@ -517,6 +522,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, n = max; *old_head = r->cons.head; + + /* add rmb barrier to avoid load/load reorder in weak +* memory model. It is noop on x86 */ + rte_smp_rmb(); + const uint32_t prod_tail = r->prod.tail; /* The subtraction is done between two unsigned 32bits value * (the result is always modulo 32 bits even if we have -- 2.7.4
[dpdk-dev] [PATCH v5 0/1] fix race condition in enqueue/dequeue because of cpu reorder
We watched a rte panic of mbuf_autotest in our qualcomm arm64 server due to a possible race condition. To fix this race condition, rmb() is needed to add between the 2 loads. Already fuctionally tested on the machines as follows: - on X86 - on arm64 --- Changelog V5: split it into 2 patchset due to the milestone concerns, this is the 1st one V4: split into small patches V3: arch specific implementation for enqueue/dequeue barrier V2: let users choose whether using load_acquire/store_release V1: rte_smp_rmb() between 2 loads Jia He (4): eal/arm64: remove the braces {} for dmb() and dsb() ring: guarantee load/load order in enqueue and dequeue ring: introduce new header file to include common functions ring: introduce new header file to support C11 memory model config/common_armv8a_linuxapp | 2 + .../common/include/arch/arm/rte_atomic_64.h| 4 +- lib/librte_eventdev/rte_event_ring.h | 6 +- lib/librte_ring/Makefile | 4 +- lib/librte_ring/rte_ring.h | 161 ++--- lib/librte_ring/rte_ring_c11_mem.h | 185 lib/librte_ring/rte_ring_generic.h | 194 + 7 files changed, 404 insertions(+), 152 deletions(-) create mode 100644 lib/librte_ring/rte_ring_c11_mem.h create mode 100644 lib/librte_ring/rte_ring_generic.h -- 2.7.4
Re: [dpdk-dev] [PATCH 1/3] eal/arm64: remove the braces {} for dmb(), dsb()
On 11/9/2017 5:38 PM, Ananyev, Konstantin Wrote: -Original Message- From: Jianbo Liu [mailto:jianbo@arm.com] Sent: Thursday, November 9, 2017 4:56 AM To: Jia He Cc: Richardson, Bruce ; jerin.ja...@caviumnetworks.com; dev@dpdk.org; olivier.m...@6wind.com; Ananyev, Konstantin ; hemant.agra...@nxp.com; jia...@hxt-semitech.com Subject: Re: [PATCH 1/3] eal/arm64: remove the braces {} for dmb(),dsb() The 11/09/2017 12:43, Jia He wrote: Hi Jianbo On 11/9/2017 11:21 AM, Jianbo Liu Wrote: The 11/09/2017 11:14, Jia He wrote: On 11/9/2017 9:22 AM, Jia He Wrote: Hi Bruce On 11/8/2017 6:28 PM, Bruce Richardson Wrote: On Wed, Nov 08, 2017 at 06:17:10AM +, Jia He wrote: for the code as follows: if (condition) rte_smp_rmb(); else rte_smp_wmb(); Without this patch, compiler will report this error: error: 'else' without a previous 'if' Signed-off-by: Jia He Signed-off-by: jia...@hxt-semitech.com --- lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h index 0b70d62..38c3393 100644 --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h @@ -43,8 +43,8 @@ extern "C" { #include "generic/rte_atomic.h" -#define dsb(opt) { asm volatile("dsb " #opt : : : "memory"); } -#define dmb(opt) { asm volatile("dmb " #opt : : : "memory"); } +#define dsb(opt) asm volatile("dsb " #opt : : : "memory"); +#define dmb(opt) asm volatile("dmb " #opt : : : "memory"); Need to remove the trailing ";" I too I think. Alternatively, to keep the braces, the standard practice is to use do { ... } while(0) If trailing ";" is not removed the code: if (condition) rte_smp_rmb(); else anything(); Sorry, why not use two different functions as your conditions passed in are fixed in the calling functions. Do you mean to split update_tail() into update_tail_enqueue() and update_tail_dequeue()? Yes. So it's not need to change dsb/dmb. That's a good idea - but you still might hit the same problem in Some different place in future... Why not to convert these macros into 'always_inline' functions then? Konstantin It makes things more complex opt needs to be redefined with types such as : __attribute__((always_inline)) void dsb( char* opt) and the input paramenter shoud be #define sy "sy" #define ld "ld" And the "#" in asm codes needs to be considerred more. IMO, the kernel way is simple and clean, isn't it? #define dmb(opt) asm volatile("dmb " #opt : : : "memory") Another choice is adding the do/while. @Ananyev @Jianbo Any thoughts? -- Cheers, Jia
[dpdk-dev] [PATCH v6] guarantee load/load order in enqueue and dequeue
From: Jia He We watched a rte panic of mbuf_autotest in our qualcomm arm64 server (Amberwing) due to a possible race condition. To fix this race condition, rmb() is needed to add between the 2 loads. Already fuctionally tested on the machines as follows: - on X86 - on arm64 --- Changelog V6: improve the text description and fix the checkpatch warnings V5: split it into 2 patchset due to the milestone concerns, this is the 1st one V4: split into small patches V3: arch specific implementation for enqueue/dequeue barrier V2: let users choose whether using load_acquire/store_release V1: rte_smp_rmb() between 2 loads Jia He (1): ring: guarantee load/load order in enqueue and dequeue lib/librte_ring/rte_ring.h | 12 1 file changed, 12 insertions(+) -- 2.7.4
[dpdk-dev] [PATCH v6] ring: guarantee load/load order in enqueue and dequeue
We watched a rte panic of mbuf_autotest in our qualcomm arm64 server (Amberwing). Root cause: In __rte_ring_move_cons_head() ... do { /* Restore n as it may change every loop */ n = max; *old_head = r->cons.head;//1st load const uint32_t prod_tail = r->prod.tail; //2nd load In weak memory order architectures(powerpc,arm), the 2nd load might be reodered before the 1st load, that makes *entries is bigger than we wanted. This nasty reording messed enque/deque up. cpu1(producer) cpu2(consumer) cpu3(consumer) load r->prod.tail in enqueue: load r->cons.tail load r->prod.head store r->prod.tail load r->cons.head load r->prod.tail ... store r->cons.{head,tail} load r->cons.head Then, r->cons.head will be bigger than prod_tail, then make *entries very big and the consumer will go forward incorrectly. After this patch, the old cons.head will be recaculated after failure of rte_atomic32_cmpset There is no such issue on X86, because X86 is strong memory order model. But rte_smp_rmb() doesn't have impact on runtime performance on X86, so keep the same code without architectures specific concerns. Signed-off-by: Jia He Signed-off-by: jie2@hxt-semitech.com Signed-off-by: bing.z...@hxt-semitech.com Acked-by: Jerin Jacob Acked-by: Jianbo Liu Cc: sta...@dpdk.org --- lib/librte_ring/rte_ring.h | 12 1 file changed, 12 insertions(+) diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index 5e9b3b7..e924438 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -409,6 +409,12 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, n = max; *old_head = r->prod.head; + + /* add rmb barrier to avoid load/load reorder in weak +* memory model. It is noop on x86 +*/ + rte_smp_rmb(); + const uint32_t cons_tail = r->cons.tail; /* * The subtraction is done between two unsigned 32bits value @@ -517,6 +523,12 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, n = max; *old_head = r->cons.head; + + /* add rmb barrier to avoid load/load reorder in weak +* memory model. It is noop on x86 +*/ + rte_smp_rmb(); + const uint32_t prod_tail = r->prod.tail; /* The subtraction is done between two unsigned 32bits value * (the result is always modulo 32 bits even if we have -- 2.7.4
[dpdk-dev] [PATCH v5 0/3] support c11 memory model barrier in librte_ring
From: Jia He To support C11 memory model barrier, 2 options are suggested by Jerin: 1. use rte_smp_rmb 2. use load_acquire/store_release CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "y" only on arm64 so far. The reason why providing 2 options is due to the performance benchmark difference in different arm machines. Already fuctionally tested on the machines as follows: - on X86 - on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=y - on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=n --- Changelog: V5: split it into 2 patchset due to the milestone concerns, this is the 2st one. Also fix checkpatch.pl warnings V4: split into small patches V3: arch specific implementation for enqueue/dequeue barrier V2: let users choose whether using load_acquire/store_release V1: rte_smp_rmb() between 2 loads *** BLURB HERE *** Jia He (3): eal/arm64: remove the braces {} for dmb() and dsb() ring: introduce new header file to include common functions ring: introduce new header file to support C11 memory model config/common_armv8a_linuxapp | 2 + .../common/include/arch/arm/rte_atomic_64.h| 4 +- lib/librte_eventdev/rte_event_ring.h | 6 +- lib/librte_ring/Makefile | 4 +- lib/librte_ring/rte_ring.h | 173 ++ lib/librte_ring/rte_ring_c11_mem.h | 186 lib/librte_ring/rte_ring_generic.h | 195 + 7 files changed, 406 insertions(+), 164 deletions(-) create mode 100644 lib/librte_ring/rte_ring_c11_mem.h create mode 100644 lib/librte_ring/rte_ring_generic.h -- 2.7.4
[dpdk-dev] [PATCH v5 2/3] ring: introduce new header file to include common functions
From: Jia He move the common part of rte_ring.h into rte_ring_generic.h move the memory barrier part into update_tail() no functional changes here Signed-off-by: Jia He Suggested-by: Jerin Jacob Suggested-by: Ananyev, Konstantin --- lib/librte_eventdev/rte_event_ring.h | 6 +- lib/librte_ring/Makefile | 3 +- lib/librte_ring/rte_ring.h | 161 + lib/librte_ring/rte_ring_generic.h | 195 +++ 4 files changed, 203 insertions(+), 162 deletions(-) create mode 100644 lib/librte_ring/rte_ring_generic.h diff --git a/lib/librte_eventdev/rte_event_ring.h b/lib/librte_eventdev/rte_event_ring.h index ea9b688..3e49458 100644 --- a/lib/librte_eventdev/rte_event_ring.h +++ b/lib/librte_eventdev/rte_event_ring.h @@ -126,9 +126,8 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r, goto end; ENQUEUE_PTRS(&r->r, &r[1], prod_head, events, n, struct rte_event); - rte_smp_wmb(); - update_tail(&r->r.prod, prod_head, prod_next, 1); + update_tail(&r->r.prod, prod_head, prod_next, 1, 1); end: if (free_space != NULL) *free_space = free_entries - n; @@ -168,9 +167,8 @@ rte_event_ring_dequeue_burst(struct rte_event_ring *r, goto end; DEQUEUE_PTRS(&r->r, &r[1], cons_head, events, n, struct rte_event); - rte_smp_rmb(); - update_tail(&r->r.cons, cons_head, cons_next, 1); + update_tail(&r->r.cons, cons_head, cons_next, 1, 0); end: if (available != NULL) diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile index e34d9d9..c959945 100644 --- a/lib/librte_ring/Makefile +++ b/lib/librte_ring/Makefile @@ -45,6 +45,7 @@ LIBABIVER := 1 SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c # install includes -SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h +SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \ + rte_ring_generic.h include $(RTE_SDK)/mk/rte.lib.mk diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index e924438..519614c 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -356,91 +356,8 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r); } \ } while (0) -static __rte_always_inline void -update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val, - uint32_t single) -{ - /* -* If there are other enqueues/dequeues in progress that preceded us, -* we need to wait for them to complete -*/ - if (!single) - while (unlikely(ht->tail != old_val)) - rte_pause(); - - ht->tail = new_val; -} - -/** - * @internal This function updates the producer head for enqueue - * - * @param r - * A pointer to the ring structure - * @param is_sp - * Indicates whether multi-producer path is needed or not - * @param n - * The number of elements we will want to enqueue, i.e. how far should the - * head be moved - * @param behavior - * RTE_RING_QUEUE_FIXED:Enqueue a fixed number of items from a ring - * RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring - * @param old_head - * Returns head value as it was before the move, i.e. where enqueue starts - * @param new_head - * Returns the current/new head value i.e. where enqueue finishes - * @param free_entries - * Returns the amount of free space in the ring BEFORE head was moved - * @return - * Actual number of objects enqueued. - * If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only. - */ -static __rte_always_inline unsigned int -__rte_ring_move_prod_head(struct rte_ring *r, int is_sp, - unsigned int n, enum rte_ring_queue_behavior behavior, - uint32_t *old_head, uint32_t *new_head, - uint32_t *free_entries) -{ - const uint32_t capacity = r->capacity; - unsigned int max = n; - int success; - - do { - /* Reset n to the initial burst count */ - n = max; - - *old_head = r->prod.head; - - /* add rmb barrier to avoid load/load reorder in weak -* memory model. It is noop on x86 -*/ - rte_smp_rmb(); - - const uint32_t cons_tail = r->cons.tail; - /* -* The subtraction is done between two unsigned 32bits value -* (the result is always modulo 32 bits even if we have -* *old_head > cons_tail). So 'free_entries' is always between 0 -* and capacity (which is < size). -*/ - *free_entries = (capacity + cons_tail - *old_head); - - /* check that we have enough room in ring */ - if (unlikely(
[dpdk-dev] [PATCH v5 1/3] eal/arm64: remove the braces {} for dmb() and dsb()
for the code as follows: if (condition) rte_smp_rmb(); else rte_smp_wmb(); Without this patch, compiler will report this error: error: 'else' without a previous 'if' Fixes: 84733fd0d75e ("eal/arm64: fix memory barrier definition") Signed-off-by: Jia He --- lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h index 0b70d62..71da29c 100644 --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h @@ -43,8 +43,8 @@ extern "C" { #include "generic/rte_atomic.h" -#define dsb(opt) { asm volatile("dsb " #opt : : : "memory"); } -#define dmb(opt) { asm volatile("dmb " #opt : : : "memory"); } +#define dsb(opt) asm volatile("dsb " #opt : : : "memory") +#define dmb(opt) asm volatile("dmb " #opt : : : "memory") #define rte_mb() dsb(sy) -- 2.7.4
[dpdk-dev] [PATCH v6 3/3] ring: introduce new header file to support C11 memory model
To support C11 memory model barrier, 2 options are suggested by Jerin: 1. use rte_smp_rmb 2. use load_acquire/store_release(refer to [1]). CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "y" only on arm64 so far. The reason why providing 2 options is due to the performance benchmark difference in different arm machines, refer to [2]. We haven't tested on ppc64. If anyone verifies it, he can add CONFIG_RTE_RING_USE_C11_MEM_MODEL=y to ppc64 config files. [1] https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170 [2] http://dpdk.org/ml/archives/dev/2017-October/080861.html Signed-off-by: Jia He Suggested-by: Jerin Jacob --- config/common_armv8a_linuxapp | 2 + lib/librte_ring/Makefile | 3 +- lib/librte_ring/rte_ring.h | 14 ++- lib/librte_ring/rte_ring_c11_mem.h | 186 + 4 files changed, 203 insertions(+), 2 deletions(-) create mode 100644 lib/librte_ring/rte_ring_c11_mem.h diff --git a/config/common_armv8a_linuxapp b/config/common_armv8a_linuxapp index 6732d1e..5b7b2eb 100644 --- a/config/common_armv8a_linuxapp +++ b/config/common_armv8a_linuxapp @@ -49,3 +49,5 @@ CONFIG_RTE_LIBRTE_SFC_EFX_PMD=n CONFIG_RTE_LIBRTE_AVP_PMD=n CONFIG_RTE_SCHED_VECTOR=n + +CONFIG_RTE_RING_USE_C11_MEM_MODEL=y diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile index c959945..a2682e7 100644 --- a/lib/librte_ring/Makefile +++ b/lib/librte_ring/Makefile @@ -46,6 +46,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c # install includes SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \ - rte_ring_generic.h + rte_ring_generic.h \ + rte_ring_c11_mem.h include $(RTE_SDK)/mk/rte.lib.mk diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index 519614c..3343eba 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -356,8 +356,20 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r); } \ } while (0) -/* Move common functions to generic file */ +/* Between load and load. there might be cpu reorder in weak model + * (powerpc/arm). + * There are 2 choices for the users + * 1.use rmb() memory barrier + * 2.use one-direcion load_acquire/store_release barrier,defined by + * CONFIG_RTE_RING_USE_C11_MEM_MODEL=y + * It depends on performance test results. + * By default, move common functions to rte_ring_generic.h + */ +#ifdef RTE_RING_USE_C11_MEM_MODEL +#include "rte_ring_c11_mem.h" +#else #include "rte_ring_generic.h" +#endif /** * @internal Enqueue several objects on the ring diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h new file mode 100644 index 000..ca21e95 --- /dev/null +++ b/lib/librte_ring/rte_ring_c11_mem.h @@ -0,0 +1,186 @@ +/*- + * BSD LICENSE + * + * Copyright(c) 2017 hxt-semitech. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * * Neither the name of hxt-semitech nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +#ifndef _RTE_RING_C11_MEM_H_ +#define _RTE_RING_C11_MEM_H_ + +static __rte_always_inline void +update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val, + uint32_t single, uint32_t enqueue) +{ + RTE_SET_USED(enqueue); + + /* +* If there are other enqueues/dequeues in progress that preceded us, +* we need to wai
Re: [dpdk-dev] [PATCH v4 0/4] fix race condition in enqueue/dequeue because of cpu reorder
Hi Bruce I knew little about DPDK's milestone I read some hints from http://dpdk.org/dev/roadmap 18.02 * Proposal deadline: November 24, 2017 * Integration deadline: January 5, 2018 I wonder whether I need to resend the patchset(the 2nd part) again after November 24? Thanks for any suggestions Cheers, Jia On 11/8/2017 11:11 PM, Jia He Wrote: Hi Bruce On 11/8/2017 8:15 PM, Bruce Richardson Wrote: On Wed, Nov 08, 2017 at 09:54:37AM +0000, Jia He wrote: We watched a rte panic of mbuf_autotest in our qualcomm arm64 server due to a possible race condition. To fix this race, there are 2 options as suggested by Jerin: 1. use rte_smp_rmb 2. use load_acquire/store_release(refer to [2]). CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "y" only on arm64 so far. The reason why providing 2 options is due to the performance benchmark difference in different arm machines. Already fuctionally tested on the machines as follows: - on X86 - on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=y - on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=n --- Changelog: V4: split into small patches V3: arch specific implementation for enqueue/dequeue barrier V2: let users choose whether using load_acquire/store_release V1: rte_smp_rmb() between 2 loads Jia He (4): eal/arm64: remove the braces {} for dmb() and dsb() ring: guarantee load/load order in enqueue and dequeue ring: introduce new header file to include common functions ring: introduce new header file to support C11 memory model I'm wondering what the merge plans are for this set, given we are now past RC3 in 17.11? As the rings are broken on ARM machines we need to merge in some fix, but I'm a little concerned about the scope of the changes from the 3rd and 4th patches. Would it be acceptable to just merge in patches 1 & 2 in 17.11 and leave the rework and C11 memory model additions in patches 3 & 4 to 18.02 release? As far as I'm concerned, it is ok. Cheers, Jia
[dpdk-dev] [PATCH V6 0/3] support c11 memory model barrier in librte_ring
To support C11 memory model barrier, 2 options are suggested by Jerin: 1. use rte_smp_rmb 2. use load_acquire/store_release CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "y" only on arm64 so far. The reason why providing 2 options is due to the performance benchmark difference in different arm machines. Already fuctionally tested on the machines as follows: - on X86 - on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=y - on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=n --- Changelog: V6: minor change in subject and log V5: split it into 2 patchset due to the milestone concerns, this is the 2st one. Also fix checkpatch.pl warnings V4: split into small patches V3: arch specific implementation for enqueue/dequeue barrier V2: let users choose whether using load_acquire/store_release V1: rte_smp_rmb() between 2 loads Jia He (3): eal/arm64: remove the braces {} for dmb(),dsb() ring: introduce new header file to include common functions ring: introduce new header file to support C11 memory model config/common_armv8a_linuxapp | 2 + .../common/include/arch/arm/rte_atomic_64.h| 4 +- lib/librte_eventdev/rte_event_ring.h | 6 +- lib/librte_ring/Makefile | 4 +- lib/librte_ring/rte_ring.h | 173 ++ lib/librte_ring/rte_ring_generic.h | 195 + 6 files changed, 220 insertions(+), 164 deletions(-) create mode 100644 lib/librte_ring/rte_ring_generic.h -- 2.7.4
[dpdk-dev] [PATCH V6 1/3] eal/arm64: remove the braces {} for dmb() and dsb()
for the code as follows: if (condition) rte_smp_rmb(); else rte_smp_wmb(); Without this patch, compiler will report this error: error: 'else' without a previous 'if' Fixes: 84733fd0d75e ("eal/arm: fix memory barrier definition") Cc: sta...@dpdk.org Signed-off-by: Jia He --- lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h index 0b70d62..71da29c 100644 --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h @@ -43,8 +43,8 @@ extern "C" { #include "generic/rte_atomic.h" -#define dsb(opt) { asm volatile("dsb " #opt : : : "memory"); } -#define dmb(opt) { asm volatile("dmb " #opt : : : "memory"); } +#define dsb(opt) asm volatile("dsb " #opt : : : "memory") +#define dmb(opt) asm volatile("dmb " #opt : : : "memory") #define rte_mb() dsb(sy) -- 2.7.4
[dpdk-dev] [PATCH V6 2/3] ring: introduce new header file to include common functions
move the common part of rte_ring.h into rte_ring_generic.h. move the memory barrier part into update_tail(). no functional changes here. Signed-off-by: Jia He Suggested-by: Jerin Jacob Suggested-by: Ananyev, Konstantin --- lib/librte_eventdev/rte_event_ring.h | 6 +- lib/librte_ring/Makefile | 3 +- lib/librte_ring/rte_ring.h | 161 + lib/librte_ring/rte_ring_generic.h | 195 +++ 4 files changed, 203 insertions(+), 162 deletions(-) create mode 100644 lib/librte_ring/rte_ring_generic.h diff --git a/lib/librte_eventdev/rte_event_ring.h b/lib/librte_eventdev/rte_event_ring.h index ea9b688..3e49458 100644 --- a/lib/librte_eventdev/rte_event_ring.h +++ b/lib/librte_eventdev/rte_event_ring.h @@ -126,9 +126,8 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r, goto end; ENQUEUE_PTRS(&r->r, &r[1], prod_head, events, n, struct rte_event); - rte_smp_wmb(); - update_tail(&r->r.prod, prod_head, prod_next, 1); + update_tail(&r->r.prod, prod_head, prod_next, 1, 1); end: if (free_space != NULL) *free_space = free_entries - n; @@ -168,9 +167,8 @@ rte_event_ring_dequeue_burst(struct rte_event_ring *r, goto end; DEQUEUE_PTRS(&r->r, &r[1], cons_head, events, n, struct rte_event); - rte_smp_rmb(); - update_tail(&r->r.cons, cons_head, cons_next, 1); + update_tail(&r->r.cons, cons_head, cons_next, 1, 0); end: if (available != NULL) diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile index e34d9d9..c959945 100644 --- a/lib/librte_ring/Makefile +++ b/lib/librte_ring/Makefile @@ -45,6 +45,7 @@ LIBABIVER := 1 SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c # install includes -SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h +SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \ + rte_ring_generic.h include $(RTE_SDK)/mk/rte.lib.mk diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index e924438..519614c 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -356,91 +356,8 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r); } \ } while (0) -static __rte_always_inline void -update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val, - uint32_t single) -{ - /* -* If there are other enqueues/dequeues in progress that preceded us, -* we need to wait for them to complete -*/ - if (!single) - while (unlikely(ht->tail != old_val)) - rte_pause(); - - ht->tail = new_val; -} - -/** - * @internal This function updates the producer head for enqueue - * - * @param r - * A pointer to the ring structure - * @param is_sp - * Indicates whether multi-producer path is needed or not - * @param n - * The number of elements we will want to enqueue, i.e. how far should the - * head be moved - * @param behavior - * RTE_RING_QUEUE_FIXED:Enqueue a fixed number of items from a ring - * RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring - * @param old_head - * Returns head value as it was before the move, i.e. where enqueue starts - * @param new_head - * Returns the current/new head value i.e. where enqueue finishes - * @param free_entries - * Returns the amount of free space in the ring BEFORE head was moved - * @return - * Actual number of objects enqueued. - * If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only. - */ -static __rte_always_inline unsigned int -__rte_ring_move_prod_head(struct rte_ring *r, int is_sp, - unsigned int n, enum rte_ring_queue_behavior behavior, - uint32_t *old_head, uint32_t *new_head, - uint32_t *free_entries) -{ - const uint32_t capacity = r->capacity; - unsigned int max = n; - int success; - - do { - /* Reset n to the initial burst count */ - n = max; - - *old_head = r->prod.head; - - /* add rmb barrier to avoid load/load reorder in weak -* memory model. It is noop on x86 -*/ - rte_smp_rmb(); - - const uint32_t cons_tail = r->cons.tail; - /* -* The subtraction is done between two unsigned 32bits value -* (the result is always modulo 32 bits even if we have -* *old_head > cons_tail). So 'free_entries' is always between 0 -* and capacity (which is < size). -*/ - *free_entries = (capacity + cons_tail - *old_head); - - /* check that we have enough room in ring */ - if (unlikely(n > *free_entries)) -
[dpdk-dev] [PATCH V6 3/3] ring: introduce new header file to support C11 memory model
To support C11 memory model barrier, 2 options are suggested by Jerin: 1. use rte_smp_rmb 2. use load_acquire/store_release(refer to [1]). CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "y" only on arm64 so far. The reason why providing 2 options is due to the performance benchmark difference in different arm machines, refer to [2]. We haven't tested on ppc64. If anyone verifies it, he can add CONFIG_RTE_RING_USE_C11_MEM_MODEL=y to ppc64 config files. [1] https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170 [2] http://dpdk.org/ml/archives/dev/2017-October/080861.html Signed-off-by: Jia He Suggested-by: Jerin Jacob --- config/common_armv8a_linuxapp | 2 + lib/librte_ring/Makefile | 3 +- lib/librte_ring/rte_ring.h | 14 ++- lib/librte_ring/rte_ring_c11_mem.h | 186 + 4 files changed, 203 insertions(+), 2 deletions(-) create mode 100644 lib/librte_ring/rte_ring_c11_mem.h diff --git a/config/common_armv8a_linuxapp b/config/common_armv8a_linuxapp index 6732d1e..5b7b2eb 100644 --- a/config/common_armv8a_linuxapp +++ b/config/common_armv8a_linuxapp @@ -49,3 +49,5 @@ CONFIG_RTE_LIBRTE_SFC_EFX_PMD=n CONFIG_RTE_LIBRTE_AVP_PMD=n CONFIG_RTE_SCHED_VECTOR=n + +CONFIG_RTE_RING_USE_C11_MEM_MODEL=y diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile index c959945..a2682e7 100644 --- a/lib/librte_ring/Makefile +++ b/lib/librte_ring/Makefile @@ -46,6 +46,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c # install includes SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \ - rte_ring_generic.h + rte_ring_generic.h \ + rte_ring_c11_mem.h include $(RTE_SDK)/mk/rte.lib.mk diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index 519614c..3343eba 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -356,8 +356,20 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r); } \ } while (0) -/* Move common functions to generic file */ +/* Between load and load. there might be cpu reorder in weak model + * (powerpc/arm). + * There are 2 choices for the users + * 1.use rmb() memory barrier + * 2.use one-direcion load_acquire/store_release barrier,defined by + * CONFIG_RTE_RING_USE_C11_MEM_MODEL=y + * It depends on performance test results. + * By default, move common functions to rte_ring_generic.h + */ +#ifdef RTE_RING_USE_C11_MEM_MODEL +#include "rte_ring_c11_mem.h" +#else #include "rte_ring_generic.h" +#endif /** * @internal Enqueue several objects on the ring diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h new file mode 100644 index 000..ca21e95 --- /dev/null +++ b/lib/librte_ring/rte_ring_c11_mem.h @@ -0,0 +1,186 @@ +/*- + * BSD LICENSE + * + * Copyright(c) 2017 hxt-semitech. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * * Neither the name of hxt-semitech nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +#ifndef _RTE_RING_C11_MEM_H_ +#define _RTE_RING_C11_MEM_H_ + +static __rte_always_inline void +update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val, + uint32_t single, uint32_t enqueue) +{ + RTE_SET_USED(enqueue); + + /* +* If there are other enqueues/dequeues in progress that preceded us, +* we need to wai
[dpdk-dev] [PATCH V7 1/3] eal/arm64: remove the braces {} for dmb() and dsb()
for the code as follows: if (condition) rte_smp_rmb(); else rte_smp_wmb(); Without this patch, compiler will report this error: error: 'else' without a previous 'if' Fixes: 84733fd0d75e ("eal/arm64: fix memory barrier definition") Cc: sta...@dpdk.org Signed-off-by: Jia He Acked-by: Jerin Jacob --- lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h index 0b70d62..71da29c 100644 --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h @@ -43,8 +43,8 @@ extern "C" { #include "generic/rte_atomic.h" -#define dsb(opt) { asm volatile("dsb " #opt : : : "memory"); } -#define dmb(opt) { asm volatile("dmb " #opt : : : "memory"); } +#define dsb(opt) asm volatile("dsb " #opt : : : "memory") +#define dmb(opt) asm volatile("dmb " #opt : : : "memory") #define rte_mb() dsb(sy) -- 2.7.4
[dpdk-dev] [PATCH V7 0/3] support c11 memory model barrier in librte_ring
To support C11 memory model barrier, 2 options are suggested by Jerin: 1. use rte_smp_rmb 2. use load_acquire/store_release CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "y" only on arm64 so far. The reason why providing 2 options is due to the performance benchmark difference in different arm machines. Already fuctionally tested on the machines as follows: - on X86 - on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=y - on arm64 with CONFIG_RTE_RING_USE_C11_MEM_MODEL=n --- Changelog: V7: fix check-git-log warnings which is suggested by Jerin V6: minor change in subject and log V5: split it into 2 patchset due to the milestone concerns, this is the 2st one. Also fix checkpatch.pl warnings V4: split into small patches V3: arch specific implementation for enqueue/dequeue barrier V2: let users choose whether using load_acquire/store_release V1: rte_smp_rmb() between 2 loads Jia He (3): eal/arm64: remove the braces {} for dmb() and dsb() ring: introduce new header file to include common functions ring: introduce new header file to support C11 memory model config/common_armv8a_linuxapp | 2 + .../common/include/arch/arm/rte_atomic_64.h| 4 +- lib/librte_eventdev/rte_event_ring.h | 6 +- lib/librte_ring/Makefile | 4 +- lib/librte_ring/rte_ring.h | 173 ++ lib/librte_ring/rte_ring_c11_mem.h | 186 lib/librte_ring/rte_ring_generic.h | 195 + 7 files changed, 406 insertions(+), 164 deletions(-) create mode 100644 lib/librte_ring/rte_ring_c11_mem.h create mode 100644 lib/librte_ring/rte_ring_generic.h -- 2.7.4
[dpdk-dev] [PATCH V7 3/3] ring: introduce new header file to support C11 memory model
To support C11 memory model barrier, 2 options are suggested by Jerin: 1. use rte_smp_rmb 2. use load_acquire/store_release(refer to [1]). CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "y" only on arm64 so far. The reason why providing 2 options is due to the performance benchmark difference in different arm machines, refer to [2]. We haven't tested on ppc64. If anyone verifies it, he can add CONFIG_RTE_RING_USE_C11_MEM_MODEL=y to ppc64 config files. [1] https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170 [2] http://dpdk.org/ml/archives/dev/2017-October/080861.html Signed-off-by: Jia He Suggested-by: Jerin Jacob Acked-by: Jerin Jacob --- config/common_armv8a_linuxapp | 2 + lib/librte_ring/Makefile | 3 +- lib/librte_ring/rte_ring.h | 14 ++- lib/librte_ring/rte_ring_c11_mem.h | 186 + 4 files changed, 203 insertions(+), 2 deletions(-) create mode 100644 lib/librte_ring/rte_ring_c11_mem.h diff --git a/config/common_armv8a_linuxapp b/config/common_armv8a_linuxapp index 6732d1e..5b7b2eb 100644 --- a/config/common_armv8a_linuxapp +++ b/config/common_armv8a_linuxapp @@ -49,3 +49,5 @@ CONFIG_RTE_LIBRTE_SFC_EFX_PMD=n CONFIG_RTE_LIBRTE_AVP_PMD=n CONFIG_RTE_SCHED_VECTOR=n + +CONFIG_RTE_RING_USE_C11_MEM_MODEL=y diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile index c959945..a2682e7 100644 --- a/lib/librte_ring/Makefile +++ b/lib/librte_ring/Makefile @@ -46,6 +46,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c # install includes SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \ - rte_ring_generic.h + rte_ring_generic.h \ + rte_ring_c11_mem.h include $(RTE_SDK)/mk/rte.lib.mk diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index 519614c..3343eba 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -356,8 +356,20 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r); } \ } while (0) -/* Move common functions to generic file */ +/* Between load and load. there might be cpu reorder in weak model + * (powerpc/arm). + * There are 2 choices for the users + * 1.use rmb() memory barrier + * 2.use one-direcion load_acquire/store_release barrier,defined by + * CONFIG_RTE_RING_USE_C11_MEM_MODEL=y + * It depends on performance test results. + * By default, move common functions to rte_ring_generic.h + */ +#ifdef RTE_RING_USE_C11_MEM_MODEL +#include "rte_ring_c11_mem.h" +#else #include "rte_ring_generic.h" +#endif /** * @internal Enqueue several objects on the ring diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h new file mode 100644 index 000..ca21e95 --- /dev/null +++ b/lib/librte_ring/rte_ring_c11_mem.h @@ -0,0 +1,186 @@ +/*- + * BSD LICENSE + * + * Copyright(c) 2017 hxt-semitech. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * * Neither the name of hxt-semitech nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +#ifndef _RTE_RING_C11_MEM_H_ +#define _RTE_RING_C11_MEM_H_ + +static __rte_always_inline void +update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val, + uint32_t single, uint32_t enqueue) +{ + RTE_SET_USED(enqueue); + + /* +* If there are other enqueues/dequeues in progress that preceded us, +
[dpdk-dev] [PATCH V7 2/3] ring: introduce new header file to include common functions
move the common part of rte_ring.h into rte_ring_generic.h. move the memory barrier part into update_tail(). no functional changes here. Signed-off-by: Jia He Suggested-by: Jerin Jacob Suggested-by: Ananyev Konstantin Acked-by: Jerin Jacob --- lib/librte_eventdev/rte_event_ring.h | 6 +- lib/librte_ring/Makefile | 3 +- lib/librte_ring/rte_ring.h | 161 + lib/librte_ring/rte_ring_generic.h | 195 +++ 4 files changed, 203 insertions(+), 162 deletions(-) create mode 100644 lib/librte_ring/rte_ring_generic.h diff --git a/lib/librte_eventdev/rte_event_ring.h b/lib/librte_eventdev/rte_event_ring.h index ea9b688..3e49458 100644 --- a/lib/librte_eventdev/rte_event_ring.h +++ b/lib/librte_eventdev/rte_event_ring.h @@ -126,9 +126,8 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r, goto end; ENQUEUE_PTRS(&r->r, &r[1], prod_head, events, n, struct rte_event); - rte_smp_wmb(); - update_tail(&r->r.prod, prod_head, prod_next, 1); + update_tail(&r->r.prod, prod_head, prod_next, 1, 1); end: if (free_space != NULL) *free_space = free_entries - n; @@ -168,9 +167,8 @@ rte_event_ring_dequeue_burst(struct rte_event_ring *r, goto end; DEQUEUE_PTRS(&r->r, &r[1], cons_head, events, n, struct rte_event); - rte_smp_rmb(); - update_tail(&r->r.cons, cons_head, cons_next, 1); + update_tail(&r->r.cons, cons_head, cons_next, 1, 0); end: if (available != NULL) diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile index e34d9d9..c959945 100644 --- a/lib/librte_ring/Makefile +++ b/lib/librte_ring/Makefile @@ -45,6 +45,7 @@ LIBABIVER := 1 SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c # install includes -SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h +SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \ + rte_ring_generic.h include $(RTE_SDK)/mk/rte.lib.mk diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index e924438..519614c 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -356,91 +356,8 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r); } \ } while (0) -static __rte_always_inline void -update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val, - uint32_t single) -{ - /* -* If there are other enqueues/dequeues in progress that preceded us, -* we need to wait for them to complete -*/ - if (!single) - while (unlikely(ht->tail != old_val)) - rte_pause(); - - ht->tail = new_val; -} - -/** - * @internal This function updates the producer head for enqueue - * - * @param r - * A pointer to the ring structure - * @param is_sp - * Indicates whether multi-producer path is needed or not - * @param n - * The number of elements we will want to enqueue, i.e. how far should the - * head be moved - * @param behavior - * RTE_RING_QUEUE_FIXED:Enqueue a fixed number of items from a ring - * RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring - * @param old_head - * Returns head value as it was before the move, i.e. where enqueue starts - * @param new_head - * Returns the current/new head value i.e. where enqueue finishes - * @param free_entries - * Returns the amount of free space in the ring BEFORE head was moved - * @return - * Actual number of objects enqueued. - * If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only. - */ -static __rte_always_inline unsigned int -__rte_ring_move_prod_head(struct rte_ring *r, int is_sp, - unsigned int n, enum rte_ring_queue_behavior behavior, - uint32_t *old_head, uint32_t *new_head, - uint32_t *free_entries) -{ - const uint32_t capacity = r->capacity; - unsigned int max = n; - int success; - - do { - /* Reset n to the initial burst count */ - n = max; - - *old_head = r->prod.head; - - /* add rmb barrier to avoid load/load reorder in weak -* memory model. It is noop on x86 -*/ - rte_smp_rmb(); - - const uint32_t cons_tail = r->cons.tail; - /* -* The subtraction is done between two unsigned 32bits value -* (the result is always modulo 32 bits even if we have -* *old_head > cons_tail). So 'free_entries' is always between 0 -* and capacity (which is < size). -*/ - *free_entries = (capacity + cons_tail - *old_head); - - /* check that we have enough room in ring */ - if (unlikely(
[dpdk-dev] About pmu cycle counter usage in armv8
Hi Jerin In [1], I met a pmu cycle counter problem (all return value is 0) occasionally. And then I submited a patch to kernel maillist, but was rejected by maintainer at last [2]. He said: "We only intend for the in-kernel perf infrastructure to access pmccntr_el0; nothing else should touch it." So maybe it is not proper for dpdk to use pmu cycle counter? [1] http://dpdk.org/ml/archives/dev/2017-November/080998.html [2]https://lkml.org/lkml/2017/11/16/22 -- Cheers, Jia
Re: [dpdk-dev] About pmu cycle counter usage in armv8
Hi Jerin Ok And I wonder why you haven't met such problem (all rd_tsc() is 0)in your test platform? Did you use an old kernel (older than v4.5-rc1)? root@aw-host:~/linux# git describe da4e4f18afe0 v4.5-rc1-8-gda4e4f1 Maybe you need to give a warning that, the usage of High-resolution cycle counter is not correct if the kernel version is newer than v4.5-rc1-8-gda4e4f1? Cheers, Jia On 12/11/2017 1:59 PM, Jerin Jacob Wrote: -Original Message- Date: Mon, 11 Dec 2017 13:38:25 +0800 From: Jia He To: Jerin Jacob , "dev@dpdk.org" Subject: About pmu cycle counter usage in armv8 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 Hi Jerin In [1], I met a pmu cycle counter problem (all return value is 0) occasionally. And then I submited a patch to kernel maillist, but was rejected by maintainer at last [2]. He said: "We only intend for the in-kernel perf infrastructure to access pmccntr_el0; nothing else should touch it." Yes. That's the reason why 1) A warning added in documentation. http://dpdk.org/doc/guides/prog_guide/profile_app.html See at last " The PMU based scheme is useful for high accuracy performance profiling with rte_rdtsc(). However, this method can not be used in conjunction with Linux userspace profiling tools like perf as this scheme alters the PMU registers state. " 2) By default it is disabled and not need for production systems. Needed only for performance debugging. So maybe it is not proper for dpdk to use pmu cycle counter? But, There is no alternative in arm64 to get high resolution counter in user space(in performance effective way) [1] http://dpdk.org/ml/archives/dev/2017-November/080998.html [2]https://lkml.org/lkml/2017/11/16/22 -- Cheers, Jia