On 06/10/2018, 09:42, "Jerin Jacob" <jerin.ja...@caviumnetworks.com> wrote:

    -----Original Message-----
    > Date: Fri, 5 Oct 2018 20:34:15 +0000
    > From: Ola Liljedahl <ola.liljed...@arm.com>
    > To: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>, Jerin Jacob
    >  <jerin.ja...@caviumnetworks.com>
    > CC: "Ananyev, Konstantin" <konstantin.anan...@intel.com>, "Gavin Hu (Arm
    >  Technology China)" <gavin...@arm.com>, "dev@dpdk.org" <dev@dpdk.org>,
    >  Steve Capper <steve.cap...@arm.com>, nd <n...@arm.com>, "sta...@dpdk.org"
    >  <sta...@dpdk.org>
    > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
    > user-agent: Microsoft-MacOutlook/10.10.0.180812
    > 
    > External Email
    > 
    > On 05/10/2018, 22:29, "Honnappa Nagarahalli" 
<honnappa.nagaraha...@arm.com> wrote:
    > 
    >     >
    >     > I doubt it is possible to benchmark with such a precision so to see 
the
    >     > potential difference of one ADD instruction.
    >     > Just changes in function alignment can affect performance by 
percents. And
    >     > the natural variation when not using a 100% deterministic system is 
going to
    >     > be a lot larger than one cycle per ring buffer operation.
    >     >
    >     > Some of the other patches are also for correctness (e.g. 
load-acquire of tail)
    >     The discussion is about this patch alone. Other patches are already 
Acked.
    > So the benchmarking then makes zero sense.
    
    Why ?
Because the noise in benchmarking (due to e.g. non-deterministic systems, 
potential changes in function and loop alignment) will be much larger than 1 
cycle additional overhead per ring buffer operation. What will benchmarking 
tell us about the performance impact of this change that adds one ALU operation?
    
    
    > 
    > 
    >     > so while performance measurements may be interesting, we can't skip 
a bug
    >     > fix just because it proves to decrease performance.
    >     IMO, this patch is not a bug fix - in terms of it fixing any failures 
with the current code.
    > It's a fix for correctness. Per the C++11 (and probably C11 as well due 
to the shared memory model), we have undefined behaviour here. If the compiler 
detects UB, it is allowed to do anything. Current compilers might not exploit 
this but future compilers could.
    
    All I am saying this, The code is not same and compiler(the very latest
    gcc 8.2) is not smart enough understand it is a dead code.
What code is dead? The ADD instruction has a purpose, it is adding an offset 
(from ring buffer start to the tail field) to a base pointer. It's is merely 
(most likely) not the optimal code sequence for any ARM processor.

 I think,
    The moment any __builtin_gcc comes the compiler add predefined template
    which has additional "add" instruction.
I suspect the add instruction is because this atomic_load operates on a struct 
member at a non-zero offset and GCC's "template(s)" for atomic operations don't 
support register + immediate offset (because basically on AArch64/A64 ISA, all 
atomic operations except atomic_load(RELAXED) and atomic_store(RELAXED) only 
support the addressing mode base register without offset). I would be surprised 
if this minor instance of non-optimal code generation couldn't be corrected in 
the compiler.

    I think this specific case,
    we ALL know that, 
    a) ht->tail will be 32 bit for life long of DPDK, it will be atomic in
    all DPDK supported processors
    b) The rte_pause() down which never make and compiler reordering etc.
For 32-bit ARM and 64-bit POWER (ppc64), the rte_pause() implementation looks 
like this (DPDK 18.08-rc0):
static inline void rte_pause(void)
{
}

How does calling this function prevent compiler optimisations e.g. of the loop 
or of surrounding memory accesses?
Is rte_pause() supposed to behave like some kind of compiler barrier? I can't 
see anything in the DPDK documentation for rte_pause() that claims this.


    so why to loose one cycle at worst case? It is easy loose one cycle and it 
very
    difficult to get one back in fastpath.
I suggest you read up on why undefined behaviour is bad in C. Have a chat with 
Andrew Pinski.

Not depending on the compiler memory barrier in rte_pause() would allow the 
compiler to make optimisations (e.g. schedule loads earlier) that actually 
increase performance. Since the atomic load of ht->tail here has relaxed MO, 
the compiler is allowed hoist later loads (and stores) ahead of it (and also 
push down and/or merge with stores after the load of ht->tail). But the 
compiler (memory) barrier in (that supposedly is part of) rte_pause() prevents 
such optimisations (well some accesses could be pushed down between the atomic 
load and the compiler barrier but not further than that).

A C compiler that supports C11 and beyond implements the C11 memory model. The 
compiler understands the memory model and can optimise memory accesses 
according to the semantics of the model and the ordering directives in the 
code. Atomic operations using ATOMIC_SEQ_CST, ATOMIC_ACQUIRE, ATOMIC_RELEASE 
(and ATOMIC_ACQ_REL, ignoring ATOMIC_CONSUME here) each allow and disallow 
certain kinds of movements and other optimisations of memory accesses (loads 
and stores, I assume prefetches are also included). Atomic operations with 
ATOMIC_RELAXED don't impose any ordering constraints so give maximum 
flexibility to the compiler. Using a compiler memory barrier (e.g. asm volatile 
("":::"memory")) is a much more brutal way of constraining the compiler.
    
    
    > 
    > 
    > 
    >     >
    >     > -- Ola
    >     >
    >     > On 05/10/2018, 22:06, "Honnappa Nagarahalli"
    >     > <honnappa.nagaraha...@arm.com> wrote:
    >     >
    >     >     Hi Jerin,
    >     >           Thank you for generating the disassembly, that is really 
helpful. I
    >     > agree with you that we have the option of moving parts 2 and 3 
forward. I
    >     > will let Gavin take a decision.
    >     >
    >     >     I suggest that we run benchmarks on this patch alone and in 
combination
    >     > with other patches in the series. We have few Arm machines and we 
will run
    >     > on all of them along with x86. We take a decision based on that.
    >     >
    >     >     Would that be a way to move forward? I think this should 
address both
    >     > your and Ola's concerns.
    >     >
    >     >     I am open for other suggestions as well.
    >     >
    >     >     Thank you,
    >     >     Honnappa
    >     >
    >     >     >
    >     >     > So you don't want to write the proper C11 code because the 
compiler
    >     >     > generates one extra instruction that way?
    >     >     > You don't even know if that one extra instruction has any 
measurable
    >     >     > impact on performance. E.g. it could be issued the cycle 
before together
    >     >     > with other instructions.
    >     >     >
    >     >     > We can complain to the compiler writers that the code 
generation for
    >     >     > __atomic_load_n(, __ATOMIC_RELAXED) is not optimal (at least 
on
    >     >     > ARM/A64). I think the problem is that the __atomic builtins 
only accept
    >     > a
    >     >     > base address without any offset and this is possibly because 
e.g.
    >     > load/store
    >     >     > exclusive (LDX/STX) and load-acquire (LDAR) and store-release 
(STLR)
    >     > only
    >     >     > accept a base register with no offset. So any offset has to 
be added
    >     > before
    >     >     > the actual "atomic" instruction, LDR in this case.
    >     >     >
    >     >     >
    >     >     > -- Ola
    >     >     >
    >     >     >
    >     >     > On 05/10/2018, 19:07, "Jerin Jacob" 
<jerin.ja...@caviumnetworks.com>
    >     >     > wrote:
    >     >     >
    >     >     >     -----Original Message-----
    >     >     >     > Date: Fri, 5 Oct 2018 15:11:44 +0000
    >     >     >     > From: Honnappa Nagarahalli 
<honnappa.nagaraha...@arm.com>
    >     >     >     > To: "Ananyev, Konstantin" 
<konstantin.anan...@intel.com>, Ola
    >     >     > Liljedahl
    >     >     >     >  <ola.liljed...@arm.com>, "Gavin Hu (Arm Technology 
China)"
    >     >     >     >  <gavin...@arm.com>, Jerin Jacob
    >     > <jerin.ja...@caviumnetworks.com>
    >     >     >     > CC: "dev@dpdk.org" <dev@dpdk.org>, Steve Capper
    >     >     > <steve.cap...@arm.com>, nd
    >     >     >     >  <n...@arm.com>, "sta...@dpdk.org" <sta...@dpdk.org>
    >     >     >     > Subject: RE: [PATCH v3 1/3] ring: read tail using 
atomic load
    >     >     >     >
    >     >     >     > > >         > Hi Jerin,
    >     >     >     > > >         >
    >     >     >     > > >         > Thanks for your review, inline comments 
from our
    >     > internal
    >     >     >     > > discussions.
    >     >     >     > > >         >
    >     >     >     > > >         > BR. Gavin
    >     >     >     > > >         >
    >     >     >     > > >         > > -----Original Message-----
    >     >     >     > > >         > > From: Jerin Jacob 
<jerin.ja...@caviumnetworks.com>
    >     >     >     > > >         > > Sent: Saturday, September 29, 2018 6:49 
PM
    >     >     >     > > >         > > To: Gavin Hu (Arm Technology China)
    >     > <gavin...@arm.com>
    >     >     >     > > >         > > Cc: dev@dpdk.org; Honnappa Nagarahalli
    >     >     >     > > >         > > <honnappa.nagaraha...@arm.com>; Steve 
Capper
    >     >     >     > > >         > > <steve.cap...@arm.com>; Ola Liljedahl
    >     >     > <ola.liljed...@arm.com>;
    >     >     >     > > nd
    >     >     >     > > >         > > <n...@arm.com>; sta...@dpdk.org
    >     >     >     > > >         > > Subject: Re: [PATCH v3 1/3] ring: read 
tail using atomic
    >     > load
    >     >     >     > > >         > >
    >     >     >     > > >         > > -----Original Message-----
    >     >     >     > > >         > > > Date: Mon, 17 Sep 2018 16:17:22 +0800
    >     >     >     > > >         > > > From: Gavin Hu <gavin...@arm.com>
    >     >     >     > > >         > > > To: dev@dpdk.org
    >     >     >     > > >         > > > CC: gavin...@arm.com,
    >     > honnappa.nagaraha...@arm.com,
    >     >     >     > > >         > > > steve.cap...@arm.com,  
ola.liljed...@arm.com,
    >     >     >     > > >         > > > jerin.ja...@caviumnetworks.com, 
n...@arm.com,
    >     >     >     > > sta...@dpdk.org
    >     >     >     > > >         > > > Subject: [PATCH v3 1/3] ring: read 
tail using atomic
    >     > load
    >     >     >     > > >         > > > X-Mailer: git-send-email 2.7.4
    >     >     >     > > >         > > >
    >     >     >     > > >         > > > External Email
    >     >     >     > > >         > > >
    >     >     >     > > >         > > > In update_tail, read ht->tail using
    >     >     > __atomic_load.Although the
    >     >     >     > > >         > > > compiler currently seems to be doing 
the right thing
    >     > even
    >     >     > without
    >     >     >     > > >         > > > _atomic_load, we don't want to give 
the compiler
    >     >     > freedom to
    >     >     >     > > optimise
    >     >     >     > > >         > > > what should be an atomic load, it 
should not be
    >     > arbitarily
    >     >     > moved
    >     >     >     > > >         > > > around.
    >     >     >     > > >         > > >
    >     >     >     > > >         > > > Fixes: 39368ebfc6 ("ring: introduce 
C11 memory model
    >     >     > barrier
    >     >     >     > > option")
    >     >     >     > > >         > > > Cc: sta...@dpdk.org
    >     >     >     > > >         > > >
    >     >     >     > > >         > > > Signed-off-by: Gavin Hu 
<gavin...@arm.com>
    >     >     >     > > >         > > > Reviewed-by: Honnappa Nagarahalli
    >     >     >     > > <honnappa.nagaraha...@arm.com>
    >     >     >     > > >         > > > Reviewed-by: Steve Capper 
<steve.cap...@arm.com>
    >     >     >     > > >         > > > Reviewed-by: Ola Liljedahl 
<ola.liljed...@arm.com>
    >     >     >     > > >         > > > ---
    >     >     >     > > >         > > >  lib/librte_ring/rte_ring_c11_mem.h | 
3 ++-
    >     >     >     > > >         > > >  1 file changed, 2 insertions(+), 1 
deletion(-)
    >     >     >     > > >         > > >
    >     >     >     > > >         > The read of ht->tail needs to be atomic, 
a non-atomic
    >     > read
    >     >     > would not
    >     >     >     > > be correct.
    >     >     >     > > >
    >     >     >     > > >         That's a 32bit value load.
    >     >     >     > > >         AFAIK on all CPUs that we support it is an 
atomic operation.
    >     >     >     > > >     [Ola] But that the ordinary C load is 
translated to an atomic
    >     > load
    >     >     > for the
    >     >     >     > > target architecture is incidental.
    >     >     >     > > >
    >     >     >     > > >     If the design requires an atomic load (which is 
the case here),
    >     > we
    >     >     >     > > > should use an atomic load on the language level. 
Then we can
    >     > be
    >     >     > sure it will
    >     >     >     > > always be translated to an atomic load for the target 
in question
    >     > or
    >     >     >     > > compilation will fail. We don't have to depend on 
assumptions.
    >     >     >     > >
    >     >     >     > > We all know that 32bit load/store on cpu we support - 
are atomic.
    >     >     >     > > If it wouldn't be the case - DPDK would be broken in 
dozen places.
    >     >     >     > > So what the point to pretend that "it might be not 
atomic" if we
    >     > do
    >     >     > know for
    >     >     >     > > sure that it is?
    >     >     >     > > I do understand that you want to use 
atomic_load(relaxed) here
    >     > for
    >     >     >     > > consistency, and to conform with C11 mem-model and I 
don't see
    >     > any
    >     >     > harm in
    >     >     >     > > that.
    >     >     >     > We can continue to discuss the topic, it is a good 
discussion. But, as
    >     > far
    >     >     > this patch is concerned, can I consider this as us having a 
consensus?
    >     > The
    >     >     > file rte_ring_c11_mem.h is specifically for C11 memory model 
and I also
    >     > do
    >     >     > not see any harm in having code that completely conforms to 
C11
    >     > memory
    >     >     > model.
    >     >     >
    >     >     >     Have you guys checked the output assembly with and 
without atomic
    >     >     > load?
    >     >     >     There is an extra "add" instruction with at least the 
code I have
    >     > checked.
    >     >     >     I think, compiler is not smart enough to understand it is 
a dead code
    >     > for
    >     >     >     arm64.
    >     >     >
    >     >     >     ➜ [~] $ aarch64-linux-gnu-gcc -v
    >     >     >     Using built-in specs.
    >     >     >     COLLECT_GCC=aarch64-linux-gnu-gcc
    >     >     >     
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8.2.0/lto-
    >     >     > wrapper
    >     >     >     Target: aarch64-linux-gnu
    >     >     >     Configured with: 
/build/aarch64-linux-gnu-gcc/src/gcc-8.2.0/configure
    >     >     >     --prefix=/usr --program-prefix=aarch64-linux-gnu-
    >     >     >     --with-local-prefix=/usr/aarch64-linux-gnu
    >     >     >     --with-sysroot=/usr/aarch64-linux-gnu
    >     >     >     --with-build-sysroot=/usr/aarch64-linux-gnu 
--libdir=/usr/lib
    >     >     >     --libexecdir=/usr/lib --target=aarch64-linux-gnu
    >     >     >     --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu 
--disable-nls
    >     >     >     --enable-languages=c,c++ --enable-shared 
--enable-threads=posix
    >     >     >     --with-system-zlib --with-isl --enable-__cxa_atexit
    >     >     >     --disable-libunwind-exceptions --enable-clocale=gnu
    >     >     >     --disable-libstdcxx-pch --disable-libssp 
--enable-gnu-unique-object
    >     >     >     --enable-linker-build-id --enable-lto --enable-plugin
    >     >     >     --enable-install-libiberty --with-linker-hash-style=gnu
    >     >     >     --enable-gnu-indirect-function --disable-multilib 
--disable-werror
    >     >     >     --enable-checking=release
    >     >     >     Thread model: posix
    >     >     >     gcc version 8.2.0 (GCC)
    >     >     >
    >     >     >
    >     >     >     # build setup
    >     >     >     make -j 8 config T=arm64-armv8a-linuxapp-gcc  
CROSS=aarch64-linux-
    >     > gnu-
    >     >     >     make -j 8 test-build CROSS=aarch64-linux-gnu-
    >     >     >
    >     >     >     # generate asm
    >     >     >     aarch64-linux-gnu-gdb -batch -ex 'file build/app/test ' 
-ex
    >     > 'disassemble /rs
    >     >     > bucket_enqueue_single'
    >     >     >
    >     >     >     I have uploaded generated file for your convenience
    >     >     >     with_atomic_load.txt(includes patch 1,2,3)
    >     >     >     -----------------------
    >     >     >     https://pastebin.com/SQ6w1yRu
    >     >     >
    >     >     >     without_atomic_load.txt(includes patch 2,3)
    >     >     >     -----------------------
    >     >     >     https://pastebin.com/BpvnD0CA
    >     >     >
    >     >     >
    >     >     >     without_atomic
    >     >     >     -------------
    >     >     >     23              if (!single)
    >     >     >        0x000000000068d290 <+240>:   85 00 00 35     cbnz    
w5, 0x68d2a0
    >     >     > <bucket_enqueue_single+256>
    >     >     >        0x000000000068d294 <+244>:   82 04 40 b9     ldr     
w2, [x4, #4]
    >     >     >        0x000000000068d298 <+248>:   5f 00 01 6b     cmp     
w2, w1
    >     >     >        0x000000000068d29c <+252>:   21 01 00 54     b.ne    
0x68d2c0
    >     >     > <bucket_enqueue_single+288>  // b.any
    >     >     >
    >     >     >     24                      while (unlikely(ht->tail != 
old_val))
    >     >     >     25                              rte_pause();
    >     >     >
    >     >     >
    >     >     >     with_atomic
    >     >     >     -----------
    >     >     >     23              if (!single)
    >     >     >        0x000000000068ceb0 <+240>:   00 10 04 91     add     
x0, x0, #0x104
    >     >     >        0x000000000068ceb4 <+244>:   84 00 00 35     cbnz    
w4, 0x68cec4
    >     >     > <bucket_enqueue_single+260>
    >     >     >        0x000000000068ceb8 <+248>:   02 00 40 b9     ldr     
w2, [x0]
    >     >     >        0x000000000068cebc <+252>:   3f 00 02 6b     cmp     
w1, w2
    >     >     >        0x000000000068cec0 <+256>:   01 09 00 54     b.ne    
0x68cfe0
    >     >     > <bucket_enqueue_single+544>  // b.any
    >     >     >
    >     >     >     24                      while (unlikely(old_val != 
__atomic_load_n(&ht->tail,
    >     >     > __ATOMIC_RELAXED)))
    >     >     >
    >     >     >
    >     >     >     I don't want to block this series of patches due this 
patch. Can we
    >     > make
    >     >     >     re spin one series with 2 and 3 patches. And Wait for 
patch 1 to
    >     > conclude?
    >     >     >
    >     >     >     Thoughts?
    >     >     >
    >     >     >
    >     >     >
    >     >     >
    >     >     >     >
    >     >     >     > > But argument that we shouldn't assume 32bit 
load/store ops as
    >     >     > atomic
    >     >     >     > > sounds a bit flaky to me.
    >     >     >     > > Konstantin
    >     >     >     > >
    >     >     >     > >
    >     >     >     > > >
    >     >     >     > > >
    >     >     >     > > >
    >     >     >     > > >         > But there are no memory ordering 
requirements (with
    >     >     >     > > >         > regards to other loads and/or stores by 
this thread) so
    >     >     > relaxed
    >     >     >     > > memory order is sufficient.
    >     >     >     > > >         > Another aspect of using __atomic_load_n() 
is that the
    >     >     >     > > > compiler cannot "optimise" this load (e.g. combine, 
hoist etc), it
    >     > has
    >     >     > to be
    >     >     >     > > done as
    >     >     >     > > >         > specified in the source code which is 
also what we need
    >     > here.
    >     >     >     > > >
    >     >     >     > > >         I think Jerin points that rte_pause() acts 
here as compiler
    >     >     > barrier too,
    >     >     >     > > >         so no need to worry that compiler would 
optimize out the
    >     > loop.
    >     >     >     > > >     [Ola] Sorry missed that. But the barrier 
behaviour of
    >     > rte_pause()
    >     >     >     > > > is not part of C11, is it essentially a hand-made 
feature to
    >     > support
    >     >     >     > > > the legacy multithreaded memory model (which uses 
explicit
    >     > HW
    >     >     > and
    >     >     >     > > compiler barriers). I'd prefer code using the C11 
memory model
    >     > not to
    >     >     >     > > depend on such legacy features.
    >     >     >     > > >
    >     >     >     > > >
    >     >     >     > > >
    >     >     >     > > >         Konstantin
    >     >     >     > > >
    >     >     >     > > >         >
    >     >     >     > > >         > One point worth mentioning though is that 
this change is
    >     > for
    >     >     >     > > > the rte_ring_c11_mem.h file, not the legacy ring. 
It may be
    >     > worth
    >     >     > persisting
    >     >     >     > > >         > with getting the C11 code right when 
people are less
    >     > excited
    >     >     > about
    >     >     >     > > sending a release out?
    >     >     >     > > >         >
    >     >     >     > > >         > We can explain that for C11 we would 
prefer to do loads
    >     > and
    >     >     > stores
    >     >     >     > > as per the C11 memory model. In the case of rte_ring, 
the code is
    >     >     >     > > >         > separated cleanly into C11 specific files 
anyway.
    >     >     >     > > >         >
    >     >     >     > > >         > I think reading ht->tail using 
__atomic_load_n() is the
    >     > most
    >     >     >     > > appropriate way. We show that ht->tail is used for
    >     > synchronization,
    >     >     > we
    >     >     >     > > >         > acknowledge that ht->tail may be written 
by other
    >     > threads
    >     >     >     > > > without any other kind of synchronization (e.g. no 
lock involved)
    >     >     > and we
    >     >     >     > > require
    >     >     >     > > >         > an atomic load (any write to ht->tail 
must also be atomic).
    >     >     >     > > >         >
    >     >     >     > > >         > Using volatile and explicit compiler (or 
processor)
    >     > memory
    >     >     > barriers
    >     >     >     > > (fences) is the legacy pre-C11 way of accomplishing 
these things.
    >     >     >     > > > There's
    >     >     >     > > >         > a reason why C11/C++11 moved away from 
the old ways.
    >     >     >     > > >         > > >
    >     >     >     > > >         > > >         __atomic_store_n(&ht->tail, 
new_val,
    >     >     > __ATOMIC_RELEASE);
    >     >     >     > > >         > > > --
    >     >     >     > > >         > > > 2.7.4
    >     >     >     > > >         > > >
    >     >     >     > > >
    >     >     >     > > >
    >     >     >     > > >
    >     >     >     >
    >     >     >
    >     >
    >     >
    > 
    > 
    > 
    

Reply via email to