-----Original Message-----
> Date: Sat, 6 Oct 2018 19:44:35 +0000
> From: Ola Liljedahl <[email protected]>
> To: Jerin Jacob <[email protected]>, "[email protected]"
> <[email protected]>
> CC: Honnappa Nagarahalli <[email protected]>, "Ananyev,
> Konstantin" <[email protected]>, "Gavin Hu (Arm Technology
> China)" <[email protected]>, Steve Capper <[email protected]>, nd
> <[email protected]>, "[email protected]" <[email protected]>
> Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> user-agent: Microsoft-MacOutlook/10.10.0.180812
>
>
> On 06/10/2018, 09:42, "Jerin Jacob" <[email protected]> wrote:
>
> -----Original Message-----
> > Date: Fri, 5 Oct 2018 20:34:15 +0000
> > From: Ola Liljedahl <[email protected]>
> > To: Honnappa Nagarahalli <[email protected]>, Jerin Jacob
> > <[email protected]>
> > CC: "Ananyev, Konstantin" <[email protected]>, "Gavin Hu (Arm
> > Technology China)" <[email protected]>, "[email protected]" <[email protected]>,
> > Steve Capper <[email protected]>, nd <[email protected]>,
> "[email protected]"
> > <[email protected]>
> > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> > user-agent: Microsoft-MacOutlook/10.10.0.180812
> >
> > External Email
> >
> > On 05/10/2018, 22:29, "Honnappa Nagarahalli"
> <[email protected]> wrote:
> >
> > >
> > > I doubt it is possible to benchmark with such a precision so to
> see the
> > > potential difference of one ADD instruction.
> > > Just changes in function alignment can affect performance by
> percents. And
> > > the natural variation when not using a 100% deterministic system
> is going to
> > > be a lot larger than one cycle per ring buffer operation.
> > >
> > > Some of the other patches are also for correctness (e.g.
> load-acquire of tail)
> > The discussion is about this patch alone. Other patches are already
> Acked.
> > So the benchmarking then makes zero sense.
>
> Why ?
> Because the noise in benchmarking (due to e.g. non-deterministic systems,
> potential changes in function and loop alignment) will be much larger than 1
> cycle additional overhead per ring buffer operation. What will benchmarking
> tell us about the performance impact of this change that adds one ALU
> operation?
Yes. There will be noise in bench marking. That the reason why checked
with generated assembly code only of this patch. Found LDR vs LDR + ADD
case.How much overhead it gives, it completely based micro
architecture, like how many instruction issue it has, constraints on executing
LD/ST
on specific instruction issue etc?
In any case LDR will be better than LDR + ADD in any micro
architecture.
>
>
> >
> >
> > > so while performance measurements may be interesting, we can't
> skip a bug
> > > fix just because it proves to decrease performance.
> > IMO, this patch is not a bug fix - in terms of it fixing any
> failures with the current code.
> > It's a fix for correctness. Per the C++11 (and probably C11 as well due
> to the shared memory model), we have undefined behaviour here. If the
> compiler detects UB, it is allowed to do anything. Current compilers might
> not exploit this but future compilers could.
>
> All I am saying this, The code is not same and compiler(the very latest
> gcc 8.2) is not smart enough understand it is a dead code.
> What code is dead? The ADD instruction has a purpose, it is adding an offset
> (from ring buffer start to the tail field) to a base pointer. It's is merely
> (most likely) not the optimal code sequence for any ARM processor.
>
> I think,
> The moment any __builtin_gcc comes the compiler add predefined template
> which has additional "add" instruction.
> I suspect the add instruction is because this atomic_load operates on a
> struct member at a non-zero offset and GCC's "template(s)" for atomic
> operations don't support register + immediate offset (because basically on
> AArch64/A64 ISA, all atomic operations except atomic_load(RELAXED) and
> atomic_store(RELAXED) only support the addressing mode base register without
> offset). I would be surprised if this minor instance of non-optimal code
> generation couldn't be corrected in the compiler.
>
> I think this specific case,
> we ALL know that,
> a) ht->tail will be 32 bit for life long of DPDK, it will be atomic in
> all DPDK supported processors
> b) The rte_pause() down which never make and compiler reordering etc.
> For 32-bit ARM and 64-bit POWER (ppc64), the rte_pause() implementation looks
> like this (DPDK 18.08-rc0):
> static inline void rte_pause(void)
> {
> }
>
> How does calling this function prevent compiler optimisations e.g. of the
> loop or of surrounding memory accesses?
> Is rte_pause() supposed to behave like some kind of compiler barrier? I can't
> see anything in the DPDK documentation for rte_pause() that claims this.
How about fixing rte_pause() then?
Meaning issuing power saving instructions on missing archs.
>
>
> so why to loose one cycle at worst case? It is easy loose one cycle and
> it very
> difficult to get one back in fastpath.
> I suggest you read up on why undefined behaviour is bad in C. Have a chat
> with Andrew Pinski.
>
> Not depending on the compiler memory barrier in rte_pause() would allow the
> compiler to make optimisations (e.g. schedule loads earlier) that actually
> increase performance. Since the atomic load of ht->tail here has relaxed MO,
> the compiler is allowed hoist later loads (and stores) ahead of it (and also
> push down and/or merge with stores after the load of ht->tail). But the
> compiler (memory) barrier in (that supposedly is part of) rte_pause()
> prevents such optimisations (well some accesses could be pushed down between
> the atomic load and the compiler barrier but not further than that).
>
> A C compiler that supports C11 and beyond implements the C11 memory model.
> The compiler understands the memory model and can optimise memory accesses
> according to the semantics of the model and the ordering directives in the
> code. Atomic operations using ATOMIC_SEQ_CST, ATOMIC_ACQUIRE, ATOMIC_RELEASE
> (and ATOMIC_ACQ_REL, ignoring ATOMIC_CONSUME here) each allow and disallow
> certain kinds of movements and other optimisations of memory accesses (loads
> and stores, I assume prefetches are also included). Atomic operations with
> ATOMIC_RELAXED don't impose any ordering constraints so give maximum
> flexibility to the compiler. Using a compiler memory barrier (e.g. asm
> volatile ("":::"memory")) is a much more brutal way of constraining the
> compiler.
In arm64 case, it will have ATOMIC_RELAXED followed by asm volatile
("":::"memory") of rte_pause().
I would n't have any issue, if the generated code code is same or better than
the exiting case. but it not the case, Right?
>
>
> >
> >
> >
> > >
> > > -- Ola
> > >
> > > On 05/10/2018, 22:06, "Honnappa Nagarahalli"
> > > <[email protected]> wrote:
> > >
> > > Hi Jerin,
> > > Thank you for generating the disassembly, that is
> really helpful. I
> > > agree with you that we have the option of moving parts 2 and 3
> forward. I
> > > will let Gavin take a decision.
> > >
> > > I suggest that we run benchmarks on this patch alone and in
> combination
> > > with other patches in the series. We have few Arm machines and we
> will run
> > > on all of them along with x86. We take a decision based on that.
> > >
> > > Would that be a way to move forward? I think this should
> address both
> > > your and Ola's concerns.
> > >
> > > I am open for other suggestions as well.
> > >
> > > Thank you,
> > > Honnappa
> > >
> > > >
> > > > So you don't want to write the proper C11 code because the
> compiler
> > > > generates one extra instruction that way?
> > > > You don't even know if that one extra instruction has any
> measurable
> > > > impact on performance. E.g. it could be issued the cycle
> before together
> > > > with other instructions.
> > > >
> > > > We can complain to the compiler writers that the code
> generation for
> > > > __atomic_load_n(, __ATOMIC_RELAXED) is not optimal (at
> least on
> > > > ARM/A64). I think the problem is that the __atomic builtins
> only accept
> > > a
> > > > base address without any offset and this is possibly
> because e.g.
> > > load/store
> > > > exclusive (LDX/STX) and load-acquire (LDAR) and
> store-release (STLR)
> > > only
> > > > accept a base register with no offset. So any offset has to
> be added
> > > before
> > > > the actual "atomic" instruction, LDR in this case.
> > > >
> > > >
> > > > -- Ola
> > > >
> > > >
> > > > On 05/10/2018, 19:07, "Jerin Jacob"
> <[email protected]>
> > > > wrote:
> > > >
> > > > -----Original Message-----
> > > > > Date: Fri, 5 Oct 2018 15:11:44 +0000
> > > > > From: Honnappa Nagarahalli
> <[email protected]>
> > > > > To: "Ananyev, Konstantin"
> <[email protected]>, Ola
> > > > Liljedahl
> > > > > <[email protected]>, "Gavin Hu (Arm Technology
> China)"
> > > > > <[email protected]>, Jerin Jacob
> > > <[email protected]>
> > > > > CC: "[email protected]" <[email protected]>, Steve Capper
> > > > <[email protected]>, nd
> > > > > <[email protected]>, "[email protected]" <[email protected]>
> > > > > Subject: RE: [PATCH v3 1/3] ring: read tail using
> atomic load
> > > > >
> > > > > > > > Hi Jerin,
> > > > > > > >
> > > > > > > > Thanks for your review, inline comments
> from our
> > > internal
> > > > > > discussions.
> > > > > > > >
> > > > > > > > BR. Gavin
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Jerin Jacob
> <[email protected]>
> > > > > > > > > Sent: Saturday, September 29, 2018
> 6:49 PM
> > > > > > > > > To: Gavin Hu (Arm Technology China)
> > > <[email protected]>
> > > > > > > > > Cc: [email protected]; Honnappa Nagarahalli
> > > > > > > > > <[email protected]>; Steve
> Capper
> > > > > > > > > <[email protected]>; Ola Liljedahl
> > > > <[email protected]>;
> > > > > > nd
> > > > > > > > > <[email protected]>; [email protected]
> > > > > > > > > Subject: Re: [PATCH v3 1/3] ring:
> read tail using atomic
> > > load
> > > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > > Date: Mon, 17 Sep 2018 16:17:22
> +0800
> > > > > > > > > > From: Gavin Hu <[email protected]>
> > > > > > > > > > To: [email protected]
> > > > > > > > > > CC: [email protected],
> > > [email protected],
> > > > > > > > > > [email protected],
> [email protected],
> > > > > > > > > > [email protected],
> [email protected],
> > > > > > [email protected]
> > > > > > > > > > Subject: [PATCH v3 1/3] ring: read
> tail using atomic
> > > load
> > > > > > > > > > X-Mailer: git-send-email 2.7.4
> > > > > > > > > >
> > > > > > > > > > External Email
> > > > > > > > > >
> > > > > > > > > > In update_tail, read ht->tail using
> > > > __atomic_load.Although the
> > > > > > > > > > compiler currently seems to be
> doing the right thing
> > > even
> > > > without
> > > > > > > > > > _atomic_load, we don't want to give
> the compiler
> > > > freedom to
> > > > > > optimise
> > > > > > > > > > what should be an atomic load, it
> should not be
> > > arbitarily
> > > > moved
> > > > > > > > > > around.
> > > > > > > > > >
> > > > > > > > > > Fixes: 39368ebfc6 ("ring: introduce
> C11 memory model
> > > > barrier
> > > > > > option")
> > > > > > > > > > Cc: [email protected]
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Gavin Hu
> <[email protected]>
> > > > > > > > > > Reviewed-by: Honnappa Nagarahalli
> > > > > > <[email protected]>
> > > > > > > > > > Reviewed-by: Steve Capper
> <[email protected]>
> > > > > > > > > > Reviewed-by: Ola Liljedahl
> <[email protected]>
> > > > > > > > > > ---
> > > > > > > > > > lib/librte_ring/rte_ring_c11_mem.h
> | 3 ++-
> > > > > > > > > > 1 file changed, 2 insertions(+), 1
> deletion(-)
> > > > > > > > > >
> > > > > > > > The read of ht->tail needs to be
> atomic, a non-atomic
> > > read
> > > > would not
> > > > > > be correct.
> > > > > > >
> > > > > > > That's a 32bit value load.
> > > > > > > AFAIK on all CPUs that we support it is
> an atomic operation.
> > > > > > > [Ola] But that the ordinary C load is
> translated to an atomic
> > > load
> > > > for the
> > > > > > target architecture is incidental.
> > > > > > >
> > > > > > > If the design requires an atomic load (which
> is the case here),
> > > we
> > > > > > > should use an atomic load on the language level.
> Then we can
> > > be
> > > > sure it will
> > > > > > always be translated to an atomic load for the
> target in question
> > > or
> > > > > > compilation will fail. We don't have to depend on
> assumptions.
> > > > > >
> > > > > > We all know that 32bit load/store on cpu we support
> - are atomic.
> > > > > > If it wouldn't be the case - DPDK would be broken
> in dozen places.
> > > > > > So what the point to pretend that "it might be not
> atomic" if we
> > > do
> > > > know for
> > > > > > sure that it is?
> > > > > > I do understand that you want to use
> atomic_load(relaxed) here
> > > for
> > > > > > consistency, and to conform with C11 mem-model and
> I don't see
> > > any
> > > > harm in
> > > > > > that.
> > > > > We can continue to discuss the topic, it is a good
> discussion. But, as
> > > far
> > > > this patch is concerned, can I consider this as us having a
> consensus?
> > > The
> > > > file rte_ring_c11_mem.h is specifically for C11 memory
> model and I also
> > > do
> > > > not see any harm in having code that completely conforms to
> C11
> > > memory
> > > > model.
> > > >
> > > > Have you guys checked the output assembly with and
> without atomic
> > > > load?
> > > > There is an extra "add" instruction with at least the
> code I have
> > > checked.
> > > > I think, compiler is not smart enough to understand it
> is a dead code
> > > for
> > > > arm64.
> > > >
> > > > ➜ [~] $ aarch64-linux-gnu-gcc -v
> > > > Using built-in specs.
> > > > COLLECT_GCC=aarch64-linux-gnu-gcc
> > > >
> COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8.2.0/lto-
> > > > wrapper
> > > > Target: aarch64-linux-gnu
> > > > Configured with:
> /build/aarch64-linux-gnu-gcc/src/gcc-8.2.0/configure
> > > > --prefix=/usr --program-prefix=aarch64-linux-gnu-
> > > > --with-local-prefix=/usr/aarch64-linux-gnu
> > > > --with-sysroot=/usr/aarch64-linux-gnu
> > > > --with-build-sysroot=/usr/aarch64-linux-gnu
> --libdir=/usr/lib
> > > > --libexecdir=/usr/lib --target=aarch64-linux-gnu
> > > > --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu
> --disable-nls
> > > > --enable-languages=c,c++ --enable-shared
> --enable-threads=posix
> > > > --with-system-zlib --with-isl --enable-__cxa_atexit
> > > > --disable-libunwind-exceptions --enable-clocale=gnu
> > > > --disable-libstdcxx-pch --disable-libssp
> --enable-gnu-unique-object
> > > > --enable-linker-build-id --enable-lto --enable-plugin
> > > > --enable-install-libiberty --with-linker-hash-style=gnu
> > > > --enable-gnu-indirect-function --disable-multilib
> --disable-werror
> > > > --enable-checking=release
> > > > Thread model: posix
> > > > gcc version 8.2.0 (GCC)
> > > >
> > > >
> > > > # build setup
> > > > make -j 8 config T=arm64-armv8a-linuxapp-gcc
> CROSS=aarch64-linux-
> > > gnu-
> > > > make -j 8 test-build CROSS=aarch64-linux-gnu-
> > > >
> > > > # generate asm
> > > > aarch64-linux-gnu-gdb -batch -ex 'file build/app/test '
> -ex
> > > 'disassemble /rs
> > > > bucket_enqueue_single'
> > > >
> > > > I have uploaded generated file for your convenience
> > > > with_atomic_load.txt(includes patch 1,2,3)
> > > > -----------------------
> > > > https://pastebin.com/SQ6w1yRu
> > > >
> > > > without_atomic_load.txt(includes patch 2,3)
> > > > -----------------------
> > > > https://pastebin.com/BpvnD0CA
> > > >
> > > >
> > > > without_atomic
> > > > -------------
> > > > 23 if (!single)
> > > > 0x000000000068d290 <+240>: 85 00 00 35 cbnz
> w5, 0x68d2a0
> > > > <bucket_enqueue_single+256>
> > > > 0x000000000068d294 <+244>: 82 04 40 b9 ldr
> w2, [x4, #4]
> > > > 0x000000000068d298 <+248>: 5f 00 01 6b cmp
> w2, w1
> > > > 0x000000000068d29c <+252>: 21 01 00 54 b.ne
> 0x68d2c0
> > > > <bucket_enqueue_single+288> // b.any
> > > >
> > > > 24 while (unlikely(ht->tail !=
> old_val))
> > > > 25 rte_pause();
> > > >
> > > >
> > > > with_atomic
> > > > -----------
> > > > 23 if (!single)
> > > > 0x000000000068ceb0 <+240>: 00 10 04 91 add
> x0, x0, #0x104
> > > > 0x000000000068ceb4 <+244>: 84 00 00 35 cbnz
> w4, 0x68cec4
> > > > <bucket_enqueue_single+260>
> > > > 0x000000000068ceb8 <+248>: 02 00 40 b9 ldr
> w2, [x0]
> > > > 0x000000000068cebc <+252>: 3f 00 02 6b cmp
> w1, w2
> > > > 0x000000000068cec0 <+256>: 01 09 00 54 b.ne
> 0x68cfe0
> > > > <bucket_enqueue_single+544> // b.any
> > > >
> > > > 24 while (unlikely(old_val !=
> __atomic_load_n(&ht->tail,
> > > > __ATOMIC_RELAXED)))
> > > >
> > > >
> > > > I don't want to block this series of patches due this
> patch. Can we
> > > make
> > > > re spin one series with 2 and 3 patches. And Wait for
> patch 1 to
> > > conclude?
> > > >
> > > > Thoughts?
> > > >
> > > >
> > > >
> > > >
> > > > >
> > > > > > But argument that we shouldn't assume 32bit
> load/store ops as
> > > > atomic
> > > > > > sounds a bit flaky to me.
> > > > > > Konstantin
> > > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > But there are no memory ordering
> requirements (with
> > > > > > > > regards to other loads and/or stores by
> this thread) so
> > > > relaxed
> > > > > > memory order is sufficient.
> > > > > > > > Another aspect of using
> __atomic_load_n() is that the
> > > > > > > compiler cannot "optimise" this load (e.g.
> combine, hoist etc), it
> > > has
> > > > to be
> > > > > > done as
> > > > > > > > specified in the source code which is
> also what we need
> > > here.
> > > > > > >
> > > > > > > I think Jerin points that rte_pause()
> acts here as compiler
> > > > barrier too,
> > > > > > > so no need to worry that compiler would
> optimize out the
> > > loop.
> > > > > > > [Ola] Sorry missed that. But the barrier
> behaviour of
> > > rte_pause()
> > > > > > > is not part of C11, is it essentially a hand-made
> feature to
> > > support
> > > > > > > the legacy multithreaded memory model (which uses
> explicit
> > > HW
> > > > and
> > > > > > compiler barriers). I'd prefer code using the C11
> memory model
> > > not to
> > > > > > depend on such legacy features.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Konstantin
> > > > > > >
> > > > > > > >
> > > > > > > > One point worth mentioning though is
> that this change is
> > > for
> > > > > > > the rte_ring_c11_mem.h file, not the legacy ring.
> It may be
> > > worth
> > > > persisting
> > > > > > > > with getting the C11 code right when
> people are less
> > > excited
> > > > about
> > > > > > sending a release out?
> > > > > > > >
> > > > > > > > We can explain that for C11 we would
> prefer to do loads
> > > and
> > > > stores
> > > > > > as per the C11 memory model. In the case of
> rte_ring, the code is
> > > > > > > > separated cleanly into C11 specific
> files anyway.
> > > > > > > >
> > > > > > > > I think reading ht->tail using
> __atomic_load_n() is the
> > > most
> > > > > > appropriate way. We show that ht->tail is used for
> > > synchronization,
> > > > we
> > > > > > > > acknowledge that ht->tail may be
> written by other
> > > threads
> > > > > > > without any other kind of synchronization (e.g.
> no lock involved)
> > > > and we
> > > > > > require
> > > > > > > > an atomic load (any write to ht->tail
> must also be atomic).
> > > > > > > >
> > > > > > > > Using volatile and explicit compiler
> (or processor)
> > > memory
> > > > barriers
> > > > > > (fences) is the legacy pre-C11 way of accomplishing
> these things.
> > > > > > > There's
> > > > > > > > a reason why C11/C++11 moved away from
> the old ways.
> > > > > > > > > >
> > > > > > > > > > __atomic_store_n(&ht->tail,
> new_val,
> > > > __ATOMIC_RELEASE);
> > > > > > > > > > --
> > > > > > > > > > 2.7.4
> > > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> > >
> >
> >
> >
>
>