-----Original Message----- > Date: Sat, 6 Oct 2018 19:44:35 +0000 > From: Ola Liljedahl <ola.liljed...@arm.com> > To: Jerin Jacob <jerin.ja...@caviumnetworks.com>, "dev@dpdk.org" > <dev@dpdk.org> > CC: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>, "Ananyev, > Konstantin" <konstantin.anan...@intel.com>, "Gavin Hu (Arm Technology > China)" <gavin...@arm.com>, Steve Capper <steve.cap...@arm.com>, nd > <n...@arm.com>, "sta...@dpdk.org" <sta...@dpdk.org> > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load > user-agent: Microsoft-MacOutlook/10.10.0.180812 > > > On 06/10/2018, 09:42, "Jerin Jacob" <jerin.ja...@caviumnetworks.com> wrote: > > -----Original Message----- > > Date: Fri, 5 Oct 2018 20:34:15 +0000 > > From: Ola Liljedahl <ola.liljed...@arm.com> > > To: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>, Jerin Jacob > > <jerin.ja...@caviumnetworks.com> > > CC: "Ananyev, Konstantin" <konstantin.anan...@intel.com>, "Gavin Hu (Arm > > Technology China)" <gavin...@arm.com>, "dev@dpdk.org" <dev@dpdk.org>, > > Steve Capper <steve.cap...@arm.com>, nd <n...@arm.com>, > "sta...@dpdk.org" > > <sta...@dpdk.org> > > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load > > user-agent: Microsoft-MacOutlook/10.10.0.180812 > > > > External Email > > > > On 05/10/2018, 22:29, "Honnappa Nagarahalli" > <honnappa.nagaraha...@arm.com> wrote: > > > > > > > > I doubt it is possible to benchmark with such a precision so to > see the > > > potential difference of one ADD instruction. > > > Just changes in function alignment can affect performance by > percents. And > > > the natural variation when not using a 100% deterministic system > is going to > > > be a lot larger than one cycle per ring buffer operation. > > > > > > Some of the other patches are also for correctness (e.g. > load-acquire of tail) > > The discussion is about this patch alone. Other patches are already > Acked. > > So the benchmarking then makes zero sense. > > Why ? > Because the noise in benchmarking (due to e.g. non-deterministic systems, > potential changes in function and loop alignment) will be much larger than 1 > cycle additional overhead per ring buffer operation. What will benchmarking > tell us about the performance impact of this change that adds one ALU > operation?
Yes. There will be noise in bench marking. That the reason why checked with generated assembly code only of this patch. Found LDR vs LDR + ADD case.How much overhead it gives, it completely based micro architecture, like how many instruction issue it has, constraints on executing LD/ST on specific instruction issue etc? In any case LDR will be better than LDR + ADD in any micro architecture. > > > > > > > > > so while performance measurements may be interesting, we can't > skip a bug > > > fix just because it proves to decrease performance. > > IMO, this patch is not a bug fix - in terms of it fixing any > failures with the current code. > > It's a fix for correctness. Per the C++11 (and probably C11 as well due > to the shared memory model), we have undefined behaviour here. If the > compiler detects UB, it is allowed to do anything. Current compilers might > not exploit this but future compilers could. > > All I am saying this, The code is not same and compiler(the very latest > gcc 8.2) is not smart enough understand it is a dead code. > What code is dead? The ADD instruction has a purpose, it is adding an offset > (from ring buffer start to the tail field) to a base pointer. It's is merely > (most likely) not the optimal code sequence for any ARM processor. > > I think, > The moment any __builtin_gcc comes the compiler add predefined template > which has additional "add" instruction. > I suspect the add instruction is because this atomic_load operates on a > struct member at a non-zero offset and GCC's "template(s)" for atomic > operations don't support register + immediate offset (because basically on > AArch64/A64 ISA, all atomic operations except atomic_load(RELAXED) and > atomic_store(RELAXED) only support the addressing mode base register without > offset). I would be surprised if this minor instance of non-optimal code > generation couldn't be corrected in the compiler. > > I think this specific case, > we ALL know that, > a) ht->tail will be 32 bit for life long of DPDK, it will be atomic in > all DPDK supported processors > b) The rte_pause() down which never make and compiler reordering etc. > For 32-bit ARM and 64-bit POWER (ppc64), the rte_pause() implementation looks > like this (DPDK 18.08-rc0): > static inline void rte_pause(void) > { > } > > How does calling this function prevent compiler optimisations e.g. of the > loop or of surrounding memory accesses? > Is rte_pause() supposed to behave like some kind of compiler barrier? I can't > see anything in the DPDK documentation for rte_pause() that claims this. How about fixing rte_pause() then? Meaning issuing power saving instructions on missing archs. > > > so why to loose one cycle at worst case? It is easy loose one cycle and > it very > difficult to get one back in fastpath. > I suggest you read up on why undefined behaviour is bad in C. Have a chat > with Andrew Pinski. > > Not depending on the compiler memory barrier in rte_pause() would allow the > compiler to make optimisations (e.g. schedule loads earlier) that actually > increase performance. Since the atomic load of ht->tail here has relaxed MO, > the compiler is allowed hoist later loads (and stores) ahead of it (and also > push down and/or merge with stores after the load of ht->tail). But the > compiler (memory) barrier in (that supposedly is part of) rte_pause() > prevents such optimisations (well some accesses could be pushed down between > the atomic load and the compiler barrier but not further than that). > > A C compiler that supports C11 and beyond implements the C11 memory model. > The compiler understands the memory model and can optimise memory accesses > according to the semantics of the model and the ordering directives in the > code. Atomic operations using ATOMIC_SEQ_CST, ATOMIC_ACQUIRE, ATOMIC_RELEASE > (and ATOMIC_ACQ_REL, ignoring ATOMIC_CONSUME here) each allow and disallow > certain kinds of movements and other optimisations of memory accesses (loads > and stores, I assume prefetches are also included). Atomic operations with > ATOMIC_RELAXED don't impose any ordering constraints so give maximum > flexibility to the compiler. Using a compiler memory barrier (e.g. asm > volatile ("":::"memory")) is a much more brutal way of constraining the > compiler. In arm64 case, it will have ATOMIC_RELAXED followed by asm volatile ("":::"memory") of rte_pause(). I would n't have any issue, if the generated code code is same or better than the exiting case. but it not the case, Right? > > > > > > > > > > > > > > -- Ola > > > > > > On 05/10/2018, 22:06, "Honnappa Nagarahalli" > > > <honnappa.nagaraha...@arm.com> wrote: > > > > > > Hi Jerin, > > > Thank you for generating the disassembly, that is > really helpful. I > > > agree with you that we have the option of moving parts 2 and 3 > forward. I > > > will let Gavin take a decision. > > > > > > I suggest that we run benchmarks on this patch alone and in > combination > > > with other patches in the series. We have few Arm machines and we > will run > > > on all of them along with x86. We take a decision based on that. > > > > > > Would that be a way to move forward? I think this should > address both > > > your and Ola's concerns. > > > > > > I am open for other suggestions as well. > > > > > > Thank you, > > > Honnappa > > > > > > > > > > > So you don't want to write the proper C11 code because the > compiler > > > > generates one extra instruction that way? > > > > You don't even know if that one extra instruction has any > measurable > > > > impact on performance. E.g. it could be issued the cycle > before together > > > > with other instructions. > > > > > > > > We can complain to the compiler writers that the code > generation for > > > > __atomic_load_n(, __ATOMIC_RELAXED) is not optimal (at > least on > > > > ARM/A64). I think the problem is that the __atomic builtins > only accept > > > a > > > > base address without any offset and this is possibly > because e.g. > > > load/store > > > > exclusive (LDX/STX) and load-acquire (LDAR) and > store-release (STLR) > > > only > > > > accept a base register with no offset. So any offset has to > be added > > > before > > > > the actual "atomic" instruction, LDR in this case. > > > > > > > > > > > > -- Ola > > > > > > > > > > > > On 05/10/2018, 19:07, "Jerin Jacob" > <jerin.ja...@caviumnetworks.com> > > > > wrote: > > > > > > > > -----Original Message----- > > > > > Date: Fri, 5 Oct 2018 15:11:44 +0000 > > > > > From: Honnappa Nagarahalli > <honnappa.nagaraha...@arm.com> > > > > > To: "Ananyev, Konstantin" > <konstantin.anan...@intel.com>, Ola > > > > Liljedahl > > > > > <ola.liljed...@arm.com>, "Gavin Hu (Arm Technology > China)" > > > > > <gavin...@arm.com>, Jerin Jacob > > > <jerin.ja...@caviumnetworks.com> > > > > > CC: "dev@dpdk.org" <dev@dpdk.org>, Steve Capper > > > > <steve.cap...@arm.com>, nd > > > > > <n...@arm.com>, "sta...@dpdk.org" <sta...@dpdk.org> > > > > > Subject: RE: [PATCH v3 1/3] ring: read tail using > atomic load > > > > > > > > > > > > > Hi Jerin, > > > > > > > > > > > > > > > > Thanks for your review, inline comments > from our > > > internal > > > > > > discussions. > > > > > > > > > > > > > > > > BR. Gavin > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > From: Jerin Jacob > <jerin.ja...@caviumnetworks.com> > > > > > > > > > Sent: Saturday, September 29, 2018 > 6:49 PM > > > > > > > > > To: Gavin Hu (Arm Technology China) > > > <gavin...@arm.com> > > > > > > > > > Cc: dev@dpdk.org; Honnappa Nagarahalli > > > > > > > > > <honnappa.nagaraha...@arm.com>; Steve > Capper > > > > > > > > > <steve.cap...@arm.com>; Ola Liljedahl > > > > <ola.liljed...@arm.com>; > > > > > > nd > > > > > > > > > <n...@arm.com>; sta...@dpdk.org > > > > > > > > > Subject: Re: [PATCH v3 1/3] ring: > read tail using atomic > > > load > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > > Date: Mon, 17 Sep 2018 16:17:22 > +0800 > > > > > > > > > > From: Gavin Hu <gavin...@arm.com> > > > > > > > > > > To: dev@dpdk.org > > > > > > > > > > CC: gavin...@arm.com, > > > honnappa.nagaraha...@arm.com, > > > > > > > > > > steve.cap...@arm.com, > ola.liljed...@arm.com, > > > > > > > > > > jerin.ja...@caviumnetworks.com, > n...@arm.com, > > > > > > sta...@dpdk.org > > > > > > > > > > Subject: [PATCH v3 1/3] ring: read > tail using atomic > > > load > > > > > > > > > > X-Mailer: git-send-email 2.7.4 > > > > > > > > > > > > > > > > > > > > External Email > > > > > > > > > > > > > > > > > > > > In update_tail, read ht->tail using > > > > __atomic_load.Although the > > > > > > > > > > compiler currently seems to be > doing the right thing > > > even > > > > without > > > > > > > > > > _atomic_load, we don't want to give > the compiler > > > > freedom to > > > > > > optimise > > > > > > > > > > what should be an atomic load, it > should not be > > > arbitarily > > > > moved > > > > > > > > > > around. > > > > > > > > > > > > > > > > > > > > Fixes: 39368ebfc6 ("ring: introduce > C11 memory model > > > > barrier > > > > > > option") > > > > > > > > > > Cc: sta...@dpdk.org > > > > > > > > > > > > > > > > > > > > Signed-off-by: Gavin Hu > <gavin...@arm.com> > > > > > > > > > > Reviewed-by: Honnappa Nagarahalli > > > > > > <honnappa.nagaraha...@arm.com> > > > > > > > > > > Reviewed-by: Steve Capper > <steve.cap...@arm.com> > > > > > > > > > > Reviewed-by: Ola Liljedahl > <ola.liljed...@arm.com> > > > > > > > > > > --- > > > > > > > > > > lib/librte_ring/rte_ring_c11_mem.h > | 3 ++- > > > > > > > > > > 1 file changed, 2 insertions(+), 1 > deletion(-) > > > > > > > > > > > > > > > > > > The read of ht->tail needs to be > atomic, a non-atomic > > > read > > > > would not > > > > > > be correct. > > > > > > > > > > > > > > That's a 32bit value load. > > > > > > > AFAIK on all CPUs that we support it is > an atomic operation. > > > > > > > [Ola] But that the ordinary C load is > translated to an atomic > > > load > > > > for the > > > > > > target architecture is incidental. > > > > > > > > > > > > > > If the design requires an atomic load (which > is the case here), > > > we > > > > > > > should use an atomic load on the language level. > Then we can > > > be > > > > sure it will > > > > > > always be translated to an atomic load for the > target in question > > > or > > > > > > compilation will fail. We don't have to depend on > assumptions. > > > > > > > > > > > > We all know that 32bit load/store on cpu we support > - are atomic. > > > > > > If it wouldn't be the case - DPDK would be broken > in dozen places. > > > > > > So what the point to pretend that "it might be not > atomic" if we > > > do > > > > know for > > > > > > sure that it is? > > > > > > I do understand that you want to use > atomic_load(relaxed) here > > > for > > > > > > consistency, and to conform with C11 mem-model and > I don't see > > > any > > > > harm in > > > > > > that. > > > > > We can continue to discuss the topic, it is a good > discussion. But, as > > > far > > > > this patch is concerned, can I consider this as us having a > consensus? > > > The > > > > file rte_ring_c11_mem.h is specifically for C11 memory > model and I also > > > do > > > > not see any harm in having code that completely conforms to > C11 > > > memory > > > > model. > > > > > > > > Have you guys checked the output assembly with and > without atomic > > > > load? > > > > There is an extra "add" instruction with at least the > code I have > > > checked. > > > > I think, compiler is not smart enough to understand it > is a dead code > > > for > > > > arm64. > > > > > > > > ➜ [~] $ aarch64-linux-gnu-gcc -v > > > > Using built-in specs. > > > > COLLECT_GCC=aarch64-linux-gnu-gcc > > > > > COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8.2.0/lto- > > > > wrapper > > > > Target: aarch64-linux-gnu > > > > Configured with: > /build/aarch64-linux-gnu-gcc/src/gcc-8.2.0/configure > > > > --prefix=/usr --program-prefix=aarch64-linux-gnu- > > > > --with-local-prefix=/usr/aarch64-linux-gnu > > > > --with-sysroot=/usr/aarch64-linux-gnu > > > > --with-build-sysroot=/usr/aarch64-linux-gnu > --libdir=/usr/lib > > > > --libexecdir=/usr/lib --target=aarch64-linux-gnu > > > > --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu > --disable-nls > > > > --enable-languages=c,c++ --enable-shared > --enable-threads=posix > > > > --with-system-zlib --with-isl --enable-__cxa_atexit > > > > --disable-libunwind-exceptions --enable-clocale=gnu > > > > --disable-libstdcxx-pch --disable-libssp > --enable-gnu-unique-object > > > > --enable-linker-build-id --enable-lto --enable-plugin > > > > --enable-install-libiberty --with-linker-hash-style=gnu > > > > --enable-gnu-indirect-function --disable-multilib > --disable-werror > > > > --enable-checking=release > > > > Thread model: posix > > > > gcc version 8.2.0 (GCC) > > > > > > > > > > > > # build setup > > > > make -j 8 config T=arm64-armv8a-linuxapp-gcc > CROSS=aarch64-linux- > > > gnu- > > > > make -j 8 test-build CROSS=aarch64-linux-gnu- > > > > > > > > # generate asm > > > > aarch64-linux-gnu-gdb -batch -ex 'file build/app/test ' > -ex > > > 'disassemble /rs > > > > bucket_enqueue_single' > > > > > > > > I have uploaded generated file for your convenience > > > > with_atomic_load.txt(includes patch 1,2,3) > > > > ----------------------- > > > > https://pastebin.com/SQ6w1yRu > > > > > > > > without_atomic_load.txt(includes patch 2,3) > > > > ----------------------- > > > > https://pastebin.com/BpvnD0CA > > > > > > > > > > > > without_atomic > > > > ------------- > > > > 23 if (!single) > > > > 0x000000000068d290 <+240>: 85 00 00 35 cbnz > w5, 0x68d2a0 > > > > <bucket_enqueue_single+256> > > > > 0x000000000068d294 <+244>: 82 04 40 b9 ldr > w2, [x4, #4] > > > > 0x000000000068d298 <+248>: 5f 00 01 6b cmp > w2, w1 > > > > 0x000000000068d29c <+252>: 21 01 00 54 b.ne > 0x68d2c0 > > > > <bucket_enqueue_single+288> // b.any > > > > > > > > 24 while (unlikely(ht->tail != > old_val)) > > > > 25 rte_pause(); > > > > > > > > > > > > with_atomic > > > > ----------- > > > > 23 if (!single) > > > > 0x000000000068ceb0 <+240>: 00 10 04 91 add > x0, x0, #0x104 > > > > 0x000000000068ceb4 <+244>: 84 00 00 35 cbnz > w4, 0x68cec4 > > > > <bucket_enqueue_single+260> > > > > 0x000000000068ceb8 <+248>: 02 00 40 b9 ldr > w2, [x0] > > > > 0x000000000068cebc <+252>: 3f 00 02 6b cmp > w1, w2 > > > > 0x000000000068cec0 <+256>: 01 09 00 54 b.ne > 0x68cfe0 > > > > <bucket_enqueue_single+544> // b.any > > > > > > > > 24 while (unlikely(old_val != > __atomic_load_n(&ht->tail, > > > > __ATOMIC_RELAXED))) > > > > > > > > > > > > I don't want to block this series of patches due this > patch. Can we > > > make > > > > re spin one series with 2 and 3 patches. And Wait for > patch 1 to > > > conclude? > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > > > > > > > > > > But argument that we shouldn't assume 32bit > load/store ops as > > > > atomic > > > > > > sounds a bit flaky to me. > > > > > > Konstantin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > But there are no memory ordering > requirements (with > > > > > > > > regards to other loads and/or stores by > this thread) so > > > > relaxed > > > > > > memory order is sufficient. > > > > > > > > Another aspect of using > __atomic_load_n() is that the > > > > > > > compiler cannot "optimise" this load (e.g. > combine, hoist etc), it > > > has > > > > to be > > > > > > done as > > > > > > > > specified in the source code which is > also what we need > > > here. > > > > > > > > > > > > > > I think Jerin points that rte_pause() > acts here as compiler > > > > barrier too, > > > > > > > so no need to worry that compiler would > optimize out the > > > loop. > > > > > > > [Ola] Sorry missed that. But the barrier > behaviour of > > > rte_pause() > > > > > > > is not part of C11, is it essentially a hand-made > feature to > > > support > > > > > > > the legacy multithreaded memory model (which uses > explicit > > > HW > > > > and > > > > > > compiler barriers). I'd prefer code using the C11 > memory model > > > not to > > > > > > depend on such legacy features. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Konstantin > > > > > > > > > > > > > > > > > > > > > > > One point worth mentioning though is > that this change is > > > for > > > > > > > the rte_ring_c11_mem.h file, not the legacy ring. > It may be > > > worth > > > > persisting > > > > > > > > with getting the C11 code right when > people are less > > > excited > > > > about > > > > > > sending a release out? > > > > > > > > > > > > > > > > We can explain that for C11 we would > prefer to do loads > > > and > > > > stores > > > > > > as per the C11 memory model. In the case of > rte_ring, the code is > > > > > > > > separated cleanly into C11 specific > files anyway. > > > > > > > > > > > > > > > > I think reading ht->tail using > __atomic_load_n() is the > > > most > > > > > > appropriate way. We show that ht->tail is used for > > > synchronization, > > > > we > > > > > > > > acknowledge that ht->tail may be > written by other > > > threads > > > > > > > without any other kind of synchronization (e.g. > no lock involved) > > > > and we > > > > > > require > > > > > > > > an atomic load (any write to ht->tail > must also be atomic). > > > > > > > > > > > > > > > > Using volatile and explicit compiler > (or processor) > > > memory > > > > barriers > > > > > > (fences) is the legacy pre-C11 way of accomplishing > these things. > > > > > > > There's > > > > > > > > a reason why C11/C++11 moved away from > the old ways. > > > > > > > > > > > > > > > > > > > > __atomic_store_n(&ht->tail, > new_val, > > > > __ATOMIC_RELEASE); > > > > > > > > > > -- > > > > > > > > > > 2.7.4 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >