> > I doubt it is possible to benchmark with such a precision so to see the > potential difference of one ADD instruction. > Just changes in function alignment can affect performance by percents. And > the natural variation when not using a 100% deterministic system is going to > be a lot larger than one cycle per ring buffer operation. > > Some of the other patches are also for correctness (e.g. load-acquire of tail) The discussion is about this patch alone. Other patches are already Acked.
> so while performance measurements may be interesting, we can't skip a bug > fix just because it proves to decrease performance. IMO, this patch is not a bug fix - in terms of it fixing any failures with the current code. > > -- Ola > > On 05/10/2018, 22:06, "Honnappa Nagarahalli" > <honnappa.nagaraha...@arm.com> wrote: > > Hi Jerin, > Thank you for generating the disassembly, that is really helpful. I > agree with you that we have the option of moving parts 2 and 3 forward. I > will let Gavin take a decision. > > I suggest that we run benchmarks on this patch alone and in combination > with other patches in the series. We have few Arm machines and we will run > on all of them along with x86. We take a decision based on that. > > Would that be a way to move forward? I think this should address both > your and Ola's concerns. > > I am open for other suggestions as well. > > Thank you, > Honnappa > > > > > So you don't want to write the proper C11 code because the compiler > > generates one extra instruction that way? > > You don't even know if that one extra instruction has any measurable > > impact on performance. E.g. it could be issued the cycle before together > > with other instructions. > > > > We can complain to the compiler writers that the code generation for > > __atomic_load_n(, __ATOMIC_RELAXED) is not optimal (at least on > > ARM/A64). I think the problem is that the __atomic builtins only accept > a > > base address without any offset and this is possibly because e.g. > load/store > > exclusive (LDX/STX) and load-acquire (LDAR) and store-release (STLR) > only > > accept a base register with no offset. So any offset has to be added > before > > the actual "atomic" instruction, LDR in this case. > > > > > > -- Ola > > > > > > On 05/10/2018, 19:07, "Jerin Jacob" <jerin.ja...@caviumnetworks.com> > > wrote: > > > > -----Original Message----- > > > Date: Fri, 5 Oct 2018 15:11:44 +0000 > > > From: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com> > > > To: "Ananyev, Konstantin" <konstantin.anan...@intel.com>, Ola > > Liljedahl > > > <ola.liljed...@arm.com>, "Gavin Hu (Arm Technology China)" > > > <gavin...@arm.com>, Jerin Jacob > <jerin.ja...@caviumnetworks.com> > > > CC: "dev@dpdk.org" <dev@dpdk.org>, Steve Capper > > <steve.cap...@arm.com>, nd > > > <n...@arm.com>, "sta...@dpdk.org" <sta...@dpdk.org> > > > Subject: RE: [PATCH v3 1/3] ring: read tail using atomic load > > > > > > > > > Hi Jerin, > > > > > > > > > > > > Thanks for your review, inline comments from our > internal > > > > discussions. > > > > > > > > > > > > BR. Gavin > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Jerin Jacob <jerin.ja...@caviumnetworks.com> > > > > > > > Sent: Saturday, September 29, 2018 6:49 PM > > > > > > > To: Gavin Hu (Arm Technology China) > <gavin...@arm.com> > > > > > > > Cc: dev@dpdk.org; Honnappa Nagarahalli > > > > > > > <honnappa.nagaraha...@arm.com>; Steve Capper > > > > > > > <steve.cap...@arm.com>; Ola Liljedahl > > <ola.liljed...@arm.com>; > > > > nd > > > > > > > <n...@arm.com>; sta...@dpdk.org > > > > > > > Subject: Re: [PATCH v3 1/3] ring: read tail using > atomic > load > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > Date: Mon, 17 Sep 2018 16:17:22 +0800 > > > > > > > > From: Gavin Hu <gavin...@arm.com> > > > > > > > > To: dev@dpdk.org > > > > > > > > CC: gavin...@arm.com, > honnappa.nagaraha...@arm.com, > > > > > > > > steve.cap...@arm.com, ola.liljed...@arm.com, > > > > > > > > jerin.ja...@caviumnetworks.com, n...@arm.com, > > > > sta...@dpdk.org > > > > > > > > Subject: [PATCH v3 1/3] ring: read tail using > atomic > load > > > > > > > > X-Mailer: git-send-email 2.7.4 > > > > > > > > > > > > > > > > External Email > > > > > > > > > > > > > > > > In update_tail, read ht->tail using > > __atomic_load.Although the > > > > > > > > compiler currently seems to be doing the right > thing > even > > without > > > > > > > > _atomic_load, we don't want to give the compiler > > freedom to > > > > optimise > > > > > > > > what should be an atomic load, it should not be > arbitarily > > moved > > > > > > > > around. > > > > > > > > > > > > > > > > Fixes: 39368ebfc6 ("ring: introduce C11 memory > model > > barrier > > > > option") > > > > > > > > Cc: sta...@dpdk.org > > > > > > > > > > > > > > > > Signed-off-by: Gavin Hu <gavin...@arm.com> > > > > > > > > Reviewed-by: Honnappa Nagarahalli > > > > <honnappa.nagaraha...@arm.com> > > > > > > > > Reviewed-by: Steve Capper <steve.cap...@arm.com> > > > > > > > > Reviewed-by: Ola Liljedahl > <ola.liljed...@arm.com> > > > > > > > > --- > > > > > > > > lib/librte_ring/rte_ring_c11_mem.h | 3 ++- > > > > > > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > The read of ht->tail needs to be atomic, a > non-atomic > read > > would not > > > > be correct. > > > > > > > > > > That's a 32bit value load. > > > > > AFAIK on all CPUs that we support it is an atomic > operation. > > > > > [Ola] But that the ordinary C load is translated to an > atomic > load > > for the > > > > target architecture is incidental. > > > > > > > > > > If the design requires an atomic load (which is the case > here), > we > > > > > should use an atomic load on the language level. Then we can > be > > sure it will > > > > always be translated to an atomic load for the target in > question > or > > > > compilation will fail. We don't have to depend on assumptions. > > > > > > > > We all know that 32bit load/store on cpu we support - are > atomic. > > > > If it wouldn't be the case - DPDK would be broken in dozen > places. > > > > So what the point to pretend that "it might be not atomic" if we > do > > know for > > > > sure that it is? > > > > I do understand that you want to use atomic_load(relaxed) here > for > > > > consistency, and to conform with C11 mem-model and I don't see > any > > harm in > > > > that. > > > We can continue to discuss the topic, it is a good discussion. > But, as > far > > this patch is concerned, can I consider this as us having a consensus? > The > > file rte_ring_c11_mem.h is specifically for C11 memory model and I also > do > > not see any harm in having code that completely conforms to C11 > memory > > model. > > > > Have you guys checked the output assembly with and without atomic > > load? > > There is an extra "add" instruction with at least the code I have > checked. > > I think, compiler is not smart enough to understand it is a dead > code > for > > arm64. > > > > ➜ [~] $ aarch64-linux-gnu-gcc -v > > Using built-in specs. > > COLLECT_GCC=aarch64-linux-gnu-gcc > > COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8.2.0/lto- > > wrapper > > Target: aarch64-linux-gnu > > Configured with: > /build/aarch64-linux-gnu-gcc/src/gcc-8.2.0/configure > > --prefix=/usr --program-prefix=aarch64-linux-gnu- > > --with-local-prefix=/usr/aarch64-linux-gnu > > --with-sysroot=/usr/aarch64-linux-gnu > > --with-build-sysroot=/usr/aarch64-linux-gnu --libdir=/usr/lib > > --libexecdir=/usr/lib --target=aarch64-linux-gnu > > --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-nls > > --enable-languages=c,c++ --enable-shared --enable-threads=posix > > --with-system-zlib --with-isl --enable-__cxa_atexit > > --disable-libunwind-exceptions --enable-clocale=gnu > > --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object > > --enable-linker-build-id --enable-lto --enable-plugin > > --enable-install-libiberty --with-linker-hash-style=gnu > > --enable-gnu-indirect-function --disable-multilib --disable-werror > > --enable-checking=release > > Thread model: posix > > gcc version 8.2.0 (GCC) > > > > > > # build setup > > make -j 8 config T=arm64-armv8a-linuxapp-gcc CROSS=aarch64-linux- > gnu- > > make -j 8 test-build CROSS=aarch64-linux-gnu- > > > > # generate asm > > aarch64-linux-gnu-gdb -batch -ex 'file build/app/test ' -ex > 'disassemble /rs > > bucket_enqueue_single' > > > > I have uploaded generated file for your convenience > > with_atomic_load.txt(includes patch 1,2,3) > > ----------------------- > > https://pastebin.com/SQ6w1yRu > > > > without_atomic_load.txt(includes patch 2,3) > > ----------------------- > > https://pastebin.com/BpvnD0CA > > > > > > without_atomic > > ------------- > > 23 if (!single) > > 0x000000000068d290 <+240>: 85 00 00 35 cbnz w5, 0x68d2a0 > > <bucket_enqueue_single+256> > > 0x000000000068d294 <+244>: 82 04 40 b9 ldr w2, [x4, #4] > > 0x000000000068d298 <+248>: 5f 00 01 6b cmp w2, w1 > > 0x000000000068d29c <+252>: 21 01 00 54 b.ne 0x68d2c0 > > <bucket_enqueue_single+288> // b.any > > > > 24 while (unlikely(ht->tail != old_val)) > > 25 rte_pause(); > > > > > > with_atomic > > ----------- > > 23 if (!single) > > 0x000000000068ceb0 <+240>: 00 10 04 91 add x0, x0, > #0x104 > > 0x000000000068ceb4 <+244>: 84 00 00 35 cbnz w4, 0x68cec4 > > <bucket_enqueue_single+260> > > 0x000000000068ceb8 <+248>: 02 00 40 b9 ldr w2, [x0] > > 0x000000000068cebc <+252>: 3f 00 02 6b cmp w1, w2 > > 0x000000000068cec0 <+256>: 01 09 00 54 b.ne 0x68cfe0 > > <bucket_enqueue_single+544> // b.any > > > > 24 while (unlikely(old_val != > __atomic_load_n(&ht->tail, > > __ATOMIC_RELAXED))) > > > > > > I don't want to block this series of patches due this patch. Can we > make > > re spin one series with 2 and 3 patches. And Wait for patch 1 to > conclude? > > > > Thoughts? > > > > > > > > > > > > > > > But argument that we shouldn't assume 32bit load/store ops as > > atomic > > > > sounds a bit flaky to me. > > > > Konstantin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > But there are no memory ordering requirements (with > > > > > > regards to other loads and/or stores by this > thread) so > > relaxed > > > > memory order is sufficient. > > > > > > Another aspect of using __atomic_load_n() is that > the > > > > > compiler cannot "optimise" this load (e.g. combine, hoist > etc), it > has > > to be > > > > done as > > > > > > specified in the source code which is also what we > need > here. > > > > > > > > > > I think Jerin points that rte_pause() acts here as > compiler > > barrier too, > > > > > so no need to worry that compiler would optimize out > the > loop. > > > > > [Ola] Sorry missed that. But the barrier behaviour of > rte_pause() > > > > > is not part of C11, is it essentially a hand-made feature to > support > > > > > the legacy multithreaded memory model (which uses explicit > HW > > and > > > > compiler barriers). I'd prefer code using the C11 memory model > not to > > > > depend on such legacy features. > > > > > > > > > > > > > > > > > > > > Konstantin > > > > > > > > > > > > > > > > > One point worth mentioning though is that this > change is > for > > > > > the rte_ring_c11_mem.h file, not the legacy ring. It may be > worth > > persisting > > > > > > with getting the C11 code right when people are less > excited > > about > > > > sending a release out? > > > > > > > > > > > > We can explain that for C11 we would prefer to do > loads > and > > stores > > > > as per the C11 memory model. In the case of rte_ring, the code > is > > > > > > separated cleanly into C11 specific files anyway. > > > > > > > > > > > > I think reading ht->tail using __atomic_load_n() is > the > most > > > > appropriate way. We show that ht->tail is used for > synchronization, > > we > > > > > > acknowledge that ht->tail may be written by other > threads > > > > > without any other kind of synchronization (e.g. no lock > involved) > > and we > > > > require > > > > > > an atomic load (any write to ht->tail must also be > atomic). > > > > > > > > > > > > Using volatile and explicit compiler (or processor) > memory > > barriers > > > > (fences) is the legacy pre-C11 way of accomplishing these > things. > > > > > There's > > > > > > a reason why C11/C++11 moved away from the old ways. > > > > > > > > > > > > > > > > __atomic_store_n(&ht->tail, new_val, > > __ATOMIC_RELEASE); > > > > > > > > -- > > > > > > > > 2.7.4 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >