-----Original Message----- > Date: Mon, 8 Oct 2018 10:25:45 +0000 > From: Ola Liljedahl <ola.liljed...@arm.com> > To: Jerin Jacob <jerin.ja...@caviumnetworks.com> > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli > <honnappa.nagaraha...@arm.com>, "Ananyev, Konstantin" > <konstantin.anan...@intel.com>, "Gavin Hu (Arm Technology China)" > <gavin...@arm.com>, Steve Capper <steve.cap...@arm.com>, nd <n...@arm.com>, > "sta...@dpdk.org" <sta...@dpdk.org> > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load > user-agent: Microsoft-MacOutlook/10.11.0.180909 > > > On 08/10/2018, 12:00, "Jerin Jacob" <jerin.ja...@caviumnetworks.com> wrote: > > -----Original Message----- > > Date: Mon, 8 Oct 2018 09:22:05 +0000 > > From: Ola Liljedahl <ola.liljed...@arm.com> > > To: Jerin Jacob <jerin.ja...@caviumnetworks.com> > > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli > > <honnappa.nagaraha...@arm.com>, "Ananyev, Konstantin" > > <konstantin.anan...@intel.com>, "Gavin Hu (Arm Technology China)" > > <gavin...@arm.com>, Steve Capper <steve.cap...@arm.com>, nd > <n...@arm.com>, > > "sta...@dpdk.org" <sta...@dpdk.org> > > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load > > user-agent: Microsoft-MacOutlook/10.11.0.180909 > > > > External Email > > > > On 08/10/2018, 08:06, "Jerin Jacob" <jerin.ja...@caviumnetworks.com> > wrote: > > > > -----Original Message----- > > > Date: Sun, 7 Oct 2018 20:44:54 +0000 > > > From: Ola Liljedahl <ola.liljed...@arm.com> > > > To: Jerin Jacob <jerin.ja...@caviumnetworks.com> > > > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli > > > <honnappa.nagaraha...@arm.com>, "Ananyev, Konstantin" > > > <konstantin.anan...@intel.com>, "Gavin Hu (Arm Technology China)" > > > <gavin...@arm.com>, Steve Capper <steve.cap...@arm.com>, nd > <n...@arm.com>, > > > "sta...@dpdk.org" <sta...@dpdk.org> > > > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load > > > user-agent: Microsoft-MacOutlook/10.11.0.180909 > > > > > > > > > Could you please fix the email client for inline reply. > > Sorry that doesn't seem to be possible with Outlook for Mac 16 or > Office365. The official Office365/Outlook > > documentation doesn't match the actual user interface... > > > > > > > > https://www.kernel.org/doc/html/v4.19-rc7/process/email-clients.html > > > > > > > > > > On 07/10/2018, 06:03, "Jerin Jacob" > <jerin.ja...@caviumnetworks.com> wrote: > > > > > > In arm64 case, it will have ATOMIC_RELAXED followed by asm > volatile ("":::"memory") of rte_pause(). > > > I would n't have any issue, if the generated code code is > same or better than the exiting case. but it not the case, Right? > > > The existing case is actually not interesting (IMO) as it exposes > undefined behaviour which allows the compiler to do anything. But you seem to > be satisfied with "works for me, right here right now". I think the cost of > avoiding undefined behaviour is acceptable (actually I don't think it even > will be noticeable). > > > > I am not convinced because of use of volatile in head and tail > indexes. > > For me that brings the defined behavior. > > As long as you don't mix in C11 atomic accesses (just use "plain" > accesses to volatile objects), > > it is AFAIK defined behaviour (but not necessarily using atomic loads > and stores). But I quoted > > the C11 spec where it explicitly mentions that mixing atomic and > non-atomic accesses to the same > > object is undefined behaviour. Don't argue with me, argue with the C11 > spec. > > If you want to disobey the spec, this should at least be called out for > in the code with a comment. > > That's boils down only one question, should we follow C11 spec? Why not > only take load > acquire and store release semantics only just like Linux kernel and > FreeBSD. > And introduce even more undefined behaviour?
Yes. The all world(Linux and Freebsd) is running with undefined behavior and still it runs. > > Does not look like C11 memory model is super efficient in term of gcc > implementation. > You are making a chicken out of a feather. > > I think this "problem" with one additional ADD instruction will only concern > __atomic_load_n(__ATOMIC_RELAXED) and __atomic_store_n(__ATOMIC_RELAXED) > because the compiler separates the address generation (add offset of struct > member) from the load or store itself. For other atomic operations and memory > orderings (e.g. __atomic_load_n(__ATOMIC_ACQUIRE), the extra ADD instruction > will be included anyway (as long as we access a non-first struct member) > because e.g. LDAR only accepts a base register with no offset. > > I suggest minimising the imposed memory orderings can have a much larger > (positive) effect on performance compared to avoiding one ADD instruction > (memory accesses are much slower than CPU ALU instructions). > Using C11 memory model and identifying exactly which objects are used for > synchronisation and whether (any) updates to shared memory are acquired or > released (no updates to shared memory means relaxed order can be used) will > provide maximum freedom to the compiler and hardware to get the best result. No more comments on this. It is not data driven. > > The FreeBSD and DPDK ring buffers show some fundamental misunderstandings > here. Instead excessive orderings and explicit barriers have been used as > band-aids, with unknown effects on performance. > > > > > > > > That the reason why I shared > > the generated assembly code. If you think other way, Pick any > compiler > > and see generated output. > > This is what one compiler for one architecture generates today. These > things change. Other things > > that used to work or worked for some specific architecture has stopped > working in newer versions of > > the compiler. > > > > > > And > > > > Freebsd implementation of ring buffer(Which DPDK derived from), > Don't have > > such logic, See > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L108 > > It looks like FreeBSD uses some kind of C11 atomic memory > model-inspired API although I don't see > > exactly how e.g. atomic_store_rel_int() is implemented. The code also > mixes in explicit barriers > > so definitively not pure C11 memory model usage. And finally, it > doesn't establish the proper > > load-acquire/store-release relationships (e.g. store-release cons_tail > requires a load-acquire cons_tail, > > same for prod_tail). > > > > "* multi-producer safe lock-free ring buffer enqueue" > > The comment is also wrong. This design is not lock-free, how could it > be when there is spinning > > (waiting) for other threads in the code? If a thread must wait for > other threads, then by definition > > the design is blocking. > > > > So you are saying that because FreeBSD is doing it wrong, DPDK can also > do it wrong? > > > > > > See below too. > > > > > > > > Skipping the compiler memory barrier in rte_pause() potentially > allows for optimisations that provide much more benefit, e.g. hiding some > cache miss latency for later loads. The DPDK ring buffer implementation is > defined so to enable inlining of enqueue/dequeue functions into the caller, > any code could immediately follow these calls. > > > > > > From INTERNATIONAL STANDARD ©ISO/IEC ISO/IEC 9899:201x > > > Programming languages — C > > > > > > 5.1.2.4 > > > 4 Two expression evaluations conflict if one of them modifies a > memory location and the other one reads or modifies the same memory location. > > > > > > 25 The execution of a program contains a data race if it contains > two conflicting actions in different threads, at least one of which is not > atomic, and neither happens before the other. Any such data race results in > undefined behavior. > > > > IMO, Both condition will satisfy if the variable is volatile and > 32bit read will atomic > > for 32b and 64b machines. If not, the problem persist for generic > case > > as well(lib/librte_ring/rte_ring_generic.h) > > The read from a volatile object is not an atomic access per the C11 > spec. It just happens to > > be translated to an instruction (on x86-64 and AArch64/A64) that > implements an atomic load. > > I don't think any compiler would change this code generation and > suddenly generate some > > non-atomic load instruction for a program that *only* uses volatile to > do "atomic" accesses. > > But a future compiler could detect the mix of atomic and non-atomic > accesses and mark this > > expression as causing undefined behaviour and that would have > consequences for code generation. > > > > > > I agree with you on C11 memory model semantics usage. The reason > why I > > propose name for the file as rte_ring_c11_mem.h as DPDK it self did > not > > had definitions for load acquire and store release semantics. > > I was looking for taking load acquire and store release semantics > > from C11 instead of creating new API like Linux kernel for > FreeBSD(APIs > > like atomic_load_acq_32(), atomic_store_rel_32()). If the file > name is your > > concern then we could create new abstractions as well. That would > help > > exiting KNI problem as well. > > I appreciate your embrace of the C11 memory model. I think it is better > for describing > > (both to the compiler and to humans) which and how objects are used for > synchronisation. > > > > However, I don't think an API as you suggest (and others have suggested > before, e.g. as > > done in ODP) is a good idea. There is an infinite amount of possible > base types, an > > increasing number of operations and a bunch of different memory > orderings, a "complete" > > API would be very large and difficult to test, and most members of the > API would never be used. > > GCC and Clang both support the __atomic intrinsics. This API avoids the > problems I > > described above. Or we could use the official C11 syntax (stdatomic.h). > But then we > > have the problem with using pre-C11 compilers... > > I have no objection, if everyone agrees to move C11 memory model > with __atomic intrinsics. But if we need to keep both have then > atomic_load_acq_32() kind of API make sense. > > > > > > > > > > > > I think, currently it mixed usage because, the same variable > declaration > > used for C11 vs non C11 usage.Ideally we wont need "volatile" for > C11 > > case. Either we need to change only to C11 mode OR have APIs for > > atomic_load_acq_() and atomic_store_rel_() to allow both models like > > Linux kernel and FreeBSD. > > > > > > > > -- Ola > > > > > > > > > > > > > > >