Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load

Jerin Jacob Sat, 06 Oct 2018 21:03:18 -0700

-----Original Message-----
> Date: Sat, 6 Oct 2018 19:44:35 +0000
> From: Ola Liljedahl <ola.liljed...@arm.com>
> To: Jerin Jacob <jerin.ja...@caviumnetworks.com>, "dev@dpdk.org"
>  <dev@dpdk.org>
> CC: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>, "Ananyev,
>  Konstantin" <konstantin.anan...@intel.com>, "Gavin Hu (Arm Technology
>  China)" <gavin...@arm.com>, Steve Capper <steve.cap...@arm.com>, nd
>  <n...@arm.com>, "sta...@dpdk.org" <sta...@dpdk.org>
> Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> user-agent: Microsoft-MacOutlook/10.10.0.180812
> 
> 
> On 06/10/2018, 09:42, "Jerin Jacob" <jerin.ja...@caviumnetworks.com> wrote:
> 
>     -----Original Message-----
>     > Date: Fri, 5 Oct 2018 20:34:15 +0000
>     > From: Ola Liljedahl <ola.liljed...@arm.com>
>     > To: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>, Jerin Jacob
>     >  <jerin.ja...@caviumnetworks.com>
>     > CC: "Ananyev, Konstantin" <konstantin.anan...@intel.com>, "Gavin Hu (Arm
>     >  Technology China)" <gavin...@arm.com>, "dev@dpdk.org" <dev@dpdk.org>,
>     >  Steve Capper <steve.cap...@arm.com>, nd <n...@arm.com>, 
> "sta...@dpdk.org"
>     >  <sta...@dpdk.org>
>     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
>     > user-agent: Microsoft-MacOutlook/10.10.0.180812
>     >
>     > External Email
>     >
>     > On 05/10/2018, 22:29, "Honnappa Nagarahalli" 
> <honnappa.nagaraha...@arm.com> wrote:
>     >
>     >     >
>     >     > I doubt it is possible to benchmark with such a precision so to 
> see the
>     >     > potential difference of one ADD instruction.
>     >     > Just changes in function alignment can affect performance by 
> percents. And
>     >     > the natural variation when not using a 100% deterministic system 
> is going to
>     >     > be a lot larger than one cycle per ring buffer operation.
>     >     >
>     >     > Some of the other patches are also for correctness (e.g. 
> load-acquire of tail)
>     >     The discussion is about this patch alone. Other patches are already 
> Acked.
>     > So the benchmarking then makes zero sense.
> 
>     Why ?
> Because the noise in benchmarking (due to e.g. non-deterministic systems, 
> potential changes in function and loop alignment) will be much larger than 1 
> cycle additional overhead per ring buffer operation. What will benchmarking 
> tell us about the performance impact of this change that adds one ALU 
> operation?


Yes. There will be noise in bench marking. That the reason why checked
with generated assembly code only of this patch. Found LDR vs LDR + ADD
case.How much overhead it gives, it completely based micro
architecture, like how many instruction issue it has, constraints on executing 
LD/ST
on specific instruction issue etc?

In any case LDR will be better than LDR + ADD in any micro
architecture.
> 
> 
>     >
>     >
>     >     > so while performance measurements may be interesting, we can't 
> skip a bug
>     >     > fix just because it proves to decrease performance.
>     >     IMO, this patch is not a bug fix - in terms of it fixing any 
> failures with the current code.
>     > It's a fix for correctness. Per the C++11 (and probably C11 as well due 
> to the shared memory model), we have undefined behaviour here. If the 
> compiler detects UB, it is allowed to do anything. Current compilers might 
> not exploit this but future compilers could.
> 
>     All I am saying this, The code is not same and compiler(the very latest
>     gcc 8.2) is not smart enough understand it is a dead code.
> What code is dead? The ADD instruction has a purpose, it is adding an offset 
> (from ring buffer start to the tail field) to a base pointer. It's is merely 
> (most likely) not the optimal code sequence for any ARM processor.
> 
>  I think,
>     The moment any __builtin_gcc comes the compiler add predefined template
>     which has additional "add" instruction.
> I suspect the add instruction is because this atomic_load operates on a 
> struct member at a non-zero offset and GCC's "template(s)" for atomic 
> operations don't support register + immediate offset (because basically on 
> AArch64/A64 ISA, all atomic operations except atomic_load(RELAXED) and 
> atomic_store(RELAXED) only support the addressing mode base register without 
> offset). I would be surprised if this minor instance of non-optimal code 
> generation couldn't be corrected in the compiler.
> 
>     I think this specific case,
>     we ALL know that,
>     a) ht->tail will be 32 bit for life long of DPDK, it will be atomic in
>     all DPDK supported processors
>     b) The rte_pause() down which never make and compiler reordering etc.
> For 32-bit ARM and 64-bit POWER (ppc64), the rte_pause() implementation looks 
> like this (DPDK 18.08-rc0):
> static inline void rte_pause(void)
> {
> }
> 
> How does calling this function prevent compiler optimisations e.g. of the 
> loop or of surrounding memory accesses?
> Is rte_pause() supposed to behave like some kind of compiler barrier? I can't 
> see anything in the DPDK documentation for rte_pause() that claims this.

How about fixing rte_pause() then?
Meaning issuing power saving instructions on missing archs.

> 
> 
>     so why to loose one cycle at worst case? It is easy loose one cycle and 
> it very
>     difficult to get one back in fastpath.
> I suggest you read up on why undefined behaviour is bad in C. Have a chat 
> with Andrew Pinski.
> 
> Not depending on the compiler memory barrier in rte_pause() would allow the 
> compiler to make optimisations (e.g. schedule loads earlier) that actually 
> increase performance. Since the atomic load of ht->tail here has relaxed MO, 
> the compiler is allowed hoist later loads (and stores) ahead of it (and also 
> push down and/or merge with stores after the load of ht->tail). But the 
> compiler (memory) barrier in (that supposedly is part of) rte_pause() 
> prevents such optimisations (well some accesses could be pushed down between 
> the atomic load and the compiler barrier but not further than that).
> 
> A C compiler that supports C11 and beyond implements the C11 memory model. 
> The compiler understands the memory model and can optimise memory accesses 
> according to the semantics of the model and the ordering directives in the 
> code. Atomic operations using ATOMIC_SEQ_CST, ATOMIC_ACQUIRE, ATOMIC_RELEASE 
> (and ATOMIC_ACQ_REL, ignoring ATOMIC_CONSUME here) each allow and disallow 
> certain kinds of movements and other optimisations of memory accesses (loads 
> and stores, I assume prefetches are also included). Atomic operations with 
> ATOMIC_RELAXED don't impose any ordering constraints so give maximum 
> flexibility to the compiler. Using a compiler memory barrier (e.g. asm 
> volatile ("":::"memory")) is a much more brutal way of constraining the 
> compiler.

In arm64 case, it will have ATOMIC_RELAXED followed by asm volatile 
("":::"memory") of rte_pause().
I would n't have any issue, if the generated code code is same or better than 
the exiting case. but it not the case, Right?


> 
> 
>     >
>     >
>     >
>     >     >
>     >     > -- Ola
>     >     >
>     >     > On 05/10/2018, 22:06, "Honnappa Nagarahalli"
>     >     > <honnappa.nagaraha...@arm.com> wrote:
>     >     >
>     >     >     Hi Jerin,
>     >     >           Thank you for generating the disassembly, that is 
> really helpful. I
>     >     > agree with you that we have the option of moving parts 2 and 3 
> forward. I
>     >     > will let Gavin take a decision.
>     >     >
>     >     >     I suggest that we run benchmarks on this patch alone and in 
> combination
>     >     > with other patches in the series. We have few Arm machines and we 
> will run
>     >     > on all of them along with x86. We take a decision based on that.
>     >     >
>     >     >     Would that be a way to move forward? I think this should 
> address both
>     >     > your and Ola's concerns.
>     >     >
>     >     >     I am open for other suggestions as well.
>     >     >
>     >     >     Thank you,
>     >     >     Honnappa
>     >     >
>     >     >     >
>     >     >     > So you don't want to write the proper C11 code because the 
> compiler
>     >     >     > generates one extra instruction that way?
>     >     >     > You don't even know if that one extra instruction has any 
> measurable
>     >     >     > impact on performance. E.g. it could be issued the cycle 
> before together
>     >     >     > with other instructions.
>     >     >     >
>     >     >     > We can complain to the compiler writers that the code 
> generation for
>     >     >     > __atomic_load_n(, __ATOMIC_RELAXED) is not optimal (at 
> least on
>     >     >     > ARM/A64). I think the problem is that the __atomic builtins 
> only accept
>     >     > a
>     >     >     > base address without any offset and this is possibly 
> because e.g.
>     >     > load/store
>     >     >     > exclusive (LDX/STX) and load-acquire (LDAR) and 
> store-release (STLR)
>     >     > only
>     >     >     > accept a base register with no offset. So any offset has to 
> be added
>     >     > before
>     >     >     > the actual "atomic" instruction, LDR in this case.
>     >     >     >
>     >     >     >
>     >     >     > -- Ola
>     >     >     >
>     >     >     >
>     >     >     > On 05/10/2018, 19:07, "Jerin Jacob" 
> <jerin.ja...@caviumnetworks.com>
>     >     >     > wrote:
>     >     >     >
>     >     >     >     -----Original Message-----
>     >     >     >     > Date: Fri, 5 Oct 2018 15:11:44 +0000
>     >     >     >     > From: Honnappa Nagarahalli 
> <honnappa.nagaraha...@arm.com>
>     >     >     >     > To: "Ananyev, Konstantin" 
> <konstantin.anan...@intel.com>, Ola
>     >     >     > Liljedahl
>     >     >     >     >  <ola.liljed...@arm.com>, "Gavin Hu (Arm Technology 
> China)"
>     >     >     >     >  <gavin...@arm.com>, Jerin Jacob
>     >     > <jerin.ja...@caviumnetworks.com>
>     >     >     >     > CC: "dev@dpdk.org" <dev@dpdk.org>, Steve Capper
>     >     >     > <steve.cap...@arm.com>, nd
>     >     >     >     >  <n...@arm.com>, "sta...@dpdk.org" <sta...@dpdk.org>
>     >     >     >     > Subject: RE: [PATCH v3 1/3] ring: read tail using 
> atomic load
>     >     >     >     >
>     >     >     >     > > >         > Hi Jerin,
>     >     >     >     > > >         >
>     >     >     >     > > >         > Thanks for your review, inline comments 
> from our
>     >     > internal
>     >     >     >     > > discussions.
>     >     >     >     > > >         >
>     >     >     >     > > >         > BR. Gavin
>     >     >     >     > > >         >
>     >     >     >     > > >         > > -----Original Message-----
>     >     >     >     > > >         > > From: Jerin Jacob 
> <jerin.ja...@caviumnetworks.com>
>     >     >     >     > > >         > > Sent: Saturday, September 29, 2018 
> 6:49 PM
>     >     >     >     > > >         > > To: Gavin Hu (Arm Technology China)
>     >     > <gavin...@arm.com>
>     >     >     >     > > >         > > Cc: dev@dpdk.org; Honnappa Nagarahalli
>     >     >     >     > > >         > > <honnappa.nagaraha...@arm.com>; Steve 
> Capper
>     >     >     >     > > >         > > <steve.cap...@arm.com>; Ola Liljedahl
>     >     >     > <ola.liljed...@arm.com>;
>     >     >     >     > > nd
>     >     >     >     > > >         > > <n...@arm.com>; sta...@dpdk.org
>     >     >     >     > > >         > > Subject: Re: [PATCH v3 1/3] ring: 
> read tail using atomic
>     >     > load
>     >     >     >     > > >         > >
>     >     >     >     > > >         > > -----Original Message-----
>     >     >     >     > > >         > > > Date: Mon, 17 Sep 2018 16:17:22 
> +0800
>     >     >     >     > > >         > > > From: Gavin Hu <gavin...@arm.com>
>     >     >     >     > > >         > > > To: dev@dpdk.org
>     >     >     >     > > >         > > > CC: gavin...@arm.com,
>     >     > honnappa.nagaraha...@arm.com,
>     >     >     >     > > >         > > > steve.cap...@arm.com,  
> ola.liljed...@arm.com,
>     >     >     >     > > >         > > > jerin.ja...@caviumnetworks.com, 
> n...@arm.com,
>     >     >     >     > > sta...@dpdk.org
>     >     >     >     > > >         > > > Subject: [PATCH v3 1/3] ring: read 
> tail using atomic
>     >     > load
>     >     >     >     > > >         > > > X-Mailer: git-send-email 2.7.4
>     >     >     >     > > >         > > >
>     >     >     >     > > >         > > > External Email
>     >     >     >     > > >         > > >
>     >     >     >     > > >         > > > In update_tail, read ht->tail using
>     >     >     > __atomic_load.Although the
>     >     >     >     > > >         > > > compiler currently seems to be 
> doing the right thing
>     >     > even
>     >     >     > without
>     >     >     >     > > >         > > > _atomic_load, we don't want to give 
> the compiler
>     >     >     > freedom to
>     >     >     >     > > optimise
>     >     >     >     > > >         > > > what should be an atomic load, it 
> should not be
>     >     > arbitarily
>     >     >     > moved
>     >     >     >     > > >         > > > around.
>     >     >     >     > > >         > > >
>     >     >     >     > > >         > > > Fixes: 39368ebfc6 ("ring: introduce 
> C11 memory model
>     >     >     > barrier
>     >     >     >     > > option")
>     >     >     >     > > >         > > > Cc: sta...@dpdk.org
>     >     >     >     > > >         > > >
>     >     >     >     > > >         > > > Signed-off-by: Gavin Hu 
> <gavin...@arm.com>
>     >     >     >     > > >         > > > Reviewed-by: Honnappa Nagarahalli
>     >     >     >     > > <honnappa.nagaraha...@arm.com>
>     >     >     >     > > >         > > > Reviewed-by: Steve Capper 
> <steve.cap...@arm.com>
>     >     >     >     > > >         > > > Reviewed-by: Ola Liljedahl 
> <ola.liljed...@arm.com>
>     >     >     >     > > >         > > > ---
>     >     >     >     > > >         > > >  lib/librte_ring/rte_ring_c11_mem.h 
> | 3 ++-
>     >     >     >     > > >         > > >  1 file changed, 2 insertions(+), 1 
> deletion(-)
>     >     >     >     > > >         > > >
>     >     >     >     > > >         > The read of ht->tail needs to be 
> atomic, a non-atomic
>     >     > read
>     >     >     > would not
>     >     >     >     > > be correct.
>     >     >     >     > > >
>     >     >     >     > > >         That's a 32bit value load.
>     >     >     >     > > >         AFAIK on all CPUs that we support it is 
> an atomic operation.
>     >     >     >     > > >     [Ola] But that the ordinary C load is 
> translated to an atomic
>     >     > load
>     >     >     > for the
>     >     >     >     > > target architecture is incidental.
>     >     >     >     > > >
>     >     >     >     > > >     If the design requires an atomic load (which 
> is the case here),
>     >     > we
>     >     >     >     > > > should use an atomic load on the language level. 
> Then we can
>     >     > be
>     >     >     > sure it will
>     >     >     >     > > always be translated to an atomic load for the 
> target in question
>     >     > or
>     >     >     >     > > compilation will fail. We don't have to depend on 
> assumptions.
>     >     >     >     > >
>     >     >     >     > > We all know that 32bit load/store on cpu we support 
> - are atomic.
>     >     >     >     > > If it wouldn't be the case - DPDK would be broken 
> in dozen places.
>     >     >     >     > > So what the point to pretend that "it might be not 
> atomic" if we
>     >     > do
>     >     >     > know for
>     >     >     >     > > sure that it is?
>     >     >     >     > > I do understand that you want to use 
> atomic_load(relaxed) here
>     >     > for
>     >     >     >     > > consistency, and to conform with C11 mem-model and 
> I don't see
>     >     > any
>     >     >     > harm in
>     >     >     >     > > that.
>     >     >     >     > We can continue to discuss the topic, it is a good 
> discussion. But, as
>     >     > far
>     >     >     > this patch is concerned, can I consider this as us having a 
> consensus?
>     >     > The
>     >     >     > file rte_ring_c11_mem.h is specifically for C11 memory 
> model and I also
>     >     > do
>     >     >     > not see any harm in having code that completely conforms to 
> C11
>     >     > memory
>     >     >     > model.
>     >     >     >
>     >     >     >     Have you guys checked the output assembly with and 
> without atomic
>     >     >     > load?
>     >     >     >     There is an extra "add" instruction with at least the 
> code I have
>     >     > checked.
>     >     >     >     I think, compiler is not smart enough to understand it 
> is a dead code
>     >     > for
>     >     >     >     arm64.
>     >     >     >
>     >     >     >     ➜ [~] $ aarch64-linux-gnu-gcc -v
>     >     >     >     Using built-in specs.
>     >     >     >     COLLECT_GCC=aarch64-linux-gnu-gcc
>     >     >     >     
> COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8.2.0/lto-
>     >     >     > wrapper
>     >     >     >     Target: aarch64-linux-gnu
>     >     >     >     Configured with: 
> /build/aarch64-linux-gnu-gcc/src/gcc-8.2.0/configure
>     >     >     >     --prefix=/usr --program-prefix=aarch64-linux-gnu-
>     >     >     >     --with-local-prefix=/usr/aarch64-linux-gnu
>     >     >     >     --with-sysroot=/usr/aarch64-linux-gnu
>     >     >     >     --with-build-sysroot=/usr/aarch64-linux-gnu 
> --libdir=/usr/lib
>     >     >     >     --libexecdir=/usr/lib --target=aarch64-linux-gnu
>     >     >     >     --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu 
> --disable-nls
>     >     >     >     --enable-languages=c,c++ --enable-shared 
> --enable-threads=posix
>     >     >     >     --with-system-zlib --with-isl --enable-__cxa_atexit
>     >     >     >     --disable-libunwind-exceptions --enable-clocale=gnu
>     >     >     >     --disable-libstdcxx-pch --disable-libssp 
> --enable-gnu-unique-object
>     >     >     >     --enable-linker-build-id --enable-lto --enable-plugin
>     >     >     >     --enable-install-libiberty --with-linker-hash-style=gnu
>     >     >     >     --enable-gnu-indirect-function --disable-multilib 
> --disable-werror
>     >     >     >     --enable-checking=release
>     >     >     >     Thread model: posix
>     >     >     >     gcc version 8.2.0 (GCC)
>     >     >     >
>     >     >     >
>     >     >     >     # build setup
>     >     >     >     make -j 8 config T=arm64-armv8a-linuxapp-gcc  
> CROSS=aarch64-linux-
>     >     > gnu-
>     >     >     >     make -j 8 test-build CROSS=aarch64-linux-gnu-
>     >     >     >
>     >     >     >     # generate asm
>     >     >     >     aarch64-linux-gnu-gdb -batch -ex 'file build/app/test ' 
> -ex
>     >     > 'disassemble /rs
>     >     >     > bucket_enqueue_single'
>     >     >     >
>     >     >     >     I have uploaded generated file for your convenience
>     >     >     >     with_atomic_load.txt(includes patch 1,2,3)
>     >     >     >     -----------------------
>     >     >     >     https://pastebin.com/SQ6w1yRu
>     >     >     >
>     >     >     >     without_atomic_load.txt(includes patch 2,3)
>     >     >     >     -----------------------
>     >     >     >     https://pastebin.com/BpvnD0CA
>     >     >     >
>     >     >     >
>     >     >     >     without_atomic
>     >     >     >     -------------
>     >     >     >     23              if (!single)
>     >     >     >        0x000000000068d290 <+240>:   85 00 00 35     cbnz    
> w5, 0x68d2a0
>     >     >     > <bucket_enqueue_single+256>
>     >     >     >        0x000000000068d294 <+244>:   82 04 40 b9     ldr     
> w2, [x4, #4]
>     >     >     >        0x000000000068d298 <+248>:   5f 00 01 6b     cmp     
> w2, w1
>     >     >     >        0x000000000068d29c <+252>:   21 01 00 54     b.ne    
> 0x68d2c0
>     >     >     > <bucket_enqueue_single+288>  // b.any
>     >     >     >
>     >     >     >     24                      while (unlikely(ht->tail != 
> old_val))
>     >     >     >     25                              rte_pause();
>     >     >     >
>     >     >     >
>     >     >     >     with_atomic
>     >     >     >     -----------
>     >     >     >     23              if (!single)
>     >     >     >        0x000000000068ceb0 <+240>:   00 10 04 91     add     
> x0, x0, #0x104
>     >     >     >        0x000000000068ceb4 <+244>:   84 00 00 35     cbnz    
> w4, 0x68cec4
>     >     >     > <bucket_enqueue_single+260>
>     >     >     >        0x000000000068ceb8 <+248>:   02 00 40 b9     ldr     
> w2, [x0]
>     >     >     >        0x000000000068cebc <+252>:   3f 00 02 6b     cmp     
> w1, w2
>     >     >     >        0x000000000068cec0 <+256>:   01 09 00 54     b.ne    
> 0x68cfe0
>     >     >     > <bucket_enqueue_single+544>  // b.any
>     >     >     >
>     >     >     >     24                      while (unlikely(old_val != 
> __atomic_load_n(&ht->tail,
>     >     >     > __ATOMIC_RELAXED)))
>     >     >     >
>     >     >     >
>     >     >     >     I don't want to block this series of patches due this 
> patch. Can we
>     >     > make
>     >     >     >     re spin one series with 2 and 3 patches. And Wait for 
> patch 1 to
>     >     > conclude?
>     >     >     >
>     >     >     >     Thoughts?
>     >     >     >
>     >     >     >
>     >     >     >
>     >     >     >
>     >     >     >     >
>     >     >     >     > > But argument that we shouldn't assume 32bit 
> load/store ops as
>     >     >     > atomic
>     >     >     >     > > sounds a bit flaky to me.
>     >     >     >     > > Konstantin
>     >     >     >     > >
>     >     >     >     > >
>     >     >     >     > > >
>     >     >     >     > > >
>     >     >     >     > > >
>     >     >     >     > > >         > But there are no memory ordering 
> requirements (with
>     >     >     >     > > >         > regards to other loads and/or stores by 
> this thread) so
>     >     >     > relaxed
>     >     >     >     > > memory order is sufficient.
>     >     >     >     > > >         > Another aspect of using 
> __atomic_load_n() is that the
>     >     >     >     > > > compiler cannot "optimise" this load (e.g. 
> combine, hoist etc), it
>     >     > has
>     >     >     > to be
>     >     >     >     > > done as
>     >     >     >     > > >         > specified in the source code which is 
> also what we need
>     >     > here.
>     >     >     >     > > >
>     >     >     >     > > >         I think Jerin points that rte_pause() 
> acts here as compiler
>     >     >     > barrier too,
>     >     >     >     > > >         so no need to worry that compiler would 
> optimize out the
>     >     > loop.
>     >     >     >     > > >     [Ola] Sorry missed that. But the barrier 
> behaviour of
>     >     > rte_pause()
>     >     >     >     > > > is not part of C11, is it essentially a hand-made 
> feature to
>     >     > support
>     >     >     >     > > > the legacy multithreaded memory model (which uses 
> explicit
>     >     > HW
>     >     >     > and
>     >     >     >     > > compiler barriers). I'd prefer code using the C11 
> memory model
>     >     > not to
>     >     >     >     > > depend on such legacy features.
>     >     >     >     > > >
>     >     >     >     > > >
>     >     >     >     > > >
>     >     >     >     > > >         Konstantin
>     >     >     >     > > >
>     >     >     >     > > >         >
>     >     >     >     > > >         > One point worth mentioning though is 
> that this change is
>     >     > for
>     >     >     >     > > > the rte_ring_c11_mem.h file, not the legacy ring. 
> It may be
>     >     > worth
>     >     >     > persisting
>     >     >     >     > > >         > with getting the C11 code right when 
> people are less
>     >     > excited
>     >     >     > about
>     >     >     >     > > sending a release out?
>     >     >     >     > > >         >
>     >     >     >     > > >         > We can explain that for C11 we would 
> prefer to do loads
>     >     > and
>     >     >     > stores
>     >     >     >     > > as per the C11 memory model. In the case of 
> rte_ring, the code is
>     >     >     >     > > >         > separated cleanly into C11 specific 
> files anyway.
>     >     >     >     > > >         >
>     >     >     >     > > >         > I think reading ht->tail using 
> __atomic_load_n() is the
>     >     > most
>     >     >     >     > > appropriate way. We show that ht->tail is used for
>     >     > synchronization,
>     >     >     > we
>     >     >     >     > > >         > acknowledge that ht->tail may be 
> written by other
>     >     > threads
>     >     >     >     > > > without any other kind of synchronization (e.g. 
> no lock involved)
>     >     >     > and we
>     >     >     >     > > require
>     >     >     >     > > >         > an atomic load (any write to ht->tail 
> must also be atomic).
>     >     >     >     > > >         >
>     >     >     >     > > >         > Using volatile and explicit compiler 
> (or processor)
>     >     > memory
>     >     >     > barriers
>     >     >     >     > > (fences) is the legacy pre-C11 way of accomplishing 
> these things.
>     >     >     >     > > > There's
>     >     >     >     > > >         > a reason why C11/C++11 moved away from 
> the old ways.
>     >     >     >     > > >         > > >
>     >     >     >     > > >         > > >         __atomic_store_n(&ht->tail, 
> new_val,
>     >     >     > __ATOMIC_RELEASE);
>     >     >     >     > > >         > > > --
>     >     >     >     > > >         > > > 2.7.4
>     >     >     >     > > >         > > >
>     >     >     >     > > >
>     >     >     >     > > >
>     >     >     >     > > >
>     >     >     >     >
>     >     >     >
>     >     >
>     >     >
>     >
>     >
>     >
> 
>

Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load

Reply via email to