-----Original Message----- > Date: Mon, 8 Oct 2018 11:21:42 +0000 > From: Ola Liljedahl <ola.liljed...@arm.com> > To: Jerin Jacob <jerin.ja...@caviumnetworks.com> > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli > <honnappa.nagaraha...@arm.com>, "Ananyev, Konstantin" > <konstantin.anan...@intel.com>, "Gavin Hu (Arm Technology China)" > <gavin...@arm.com>, Steve Capper <steve.cap...@arm.com>, nd <n...@arm.com>, > "sta...@dpdk.org" <sta...@dpdk.org> > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load > user-agent: Microsoft-MacOutlook/10.11.0.180909 > > > On 08/10/2018, 12:47, "Jerin Jacob" <jerin.ja...@caviumnetworks.com> wrote: > > -----Original Message----- > > Date: Mon, 8 Oct 2018 10:25:45 +0000 > > From: Ola Liljedahl <ola.liljed...@arm.com> > > To: Jerin Jacob <jerin.ja...@caviumnetworks.com> > > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli > > <honnappa.nagaraha...@arm.com>, "Ananyev, Konstantin" > > <konstantin.anan...@intel.com>, "Gavin Hu (Arm Technology China)" > > <gavin...@arm.com>, Steve Capper <steve.cap...@arm.com>, nd > <n...@arm.com>, > > "sta...@dpdk.org" <sta...@dpdk.org> > > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load > > user-agent: Microsoft-MacOutlook/10.11.0.180909 > > > > > > On 08/10/2018, 12:00, "Jerin Jacob" <jerin.ja...@caviumnetworks.com> > wrote: > > > > -----Original Message----- > > > Date: Mon, 8 Oct 2018 09:22:05 +0000 > > > From: Ola Liljedahl <ola.liljed...@arm.com> > > > To: Jerin Jacob <jerin.ja...@caviumnetworks.com> > > > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli > > > <honnappa.nagaraha...@arm.com>, "Ananyev, Konstantin" > > > <konstantin.anan...@intel.com>, "Gavin Hu (Arm Technology China)" > > > <gavin...@arm.com>, Steve Capper <steve.cap...@arm.com>, nd > <n...@arm.com>, > > > "sta...@dpdk.org" <sta...@dpdk.org> > > > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load > > > user-agent: Microsoft-MacOutlook/10.11.0.180909 > > > > > > External Email > > > > > > On 08/10/2018, 08:06, "Jerin Jacob" > <jerin.ja...@caviumnetworks.com> wrote: > > > > > > -----Original Message----- > > > > Date: Sun, 7 Oct 2018 20:44:54 +0000 > > > > From: Ola Liljedahl <ola.liljed...@arm.com> > > > > To: Jerin Jacob <jerin.ja...@caviumnetworks.com> > > > > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli > > > > <honnappa.nagaraha...@arm.com>, "Ananyev, Konstantin" > > > > <konstantin.anan...@intel.com>, "Gavin Hu (Arm Technology > China)" > > > > <gavin...@arm.com>, Steve Capper <steve.cap...@arm.com>, > nd <n...@arm.com>, > > > > "sta...@dpdk.org" <sta...@dpdk.org> > > > > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic > load > > > > user-agent: Microsoft-MacOutlook/10.11.0.180909 > > > > > > > > > > > > > Could you please fix the email client for inline reply. > > > Sorry that doesn't seem to be possible with Outlook for Mac 16 or > Office365. The official Office365/Outlook > > > documentation doesn't match the actual user interface... > > > > > > > > > > > > > https://www.kernel.org/doc/html/v4.19-rc7/process/email-clients.html > > > > > > > > > > > > > > On 07/10/2018, 06:03, "Jerin Jacob" > <jerin.ja...@caviumnetworks.com> wrote: > > > > > > > > In arm64 case, it will have ATOMIC_RELAXED followed by > asm volatile ("":::"memory") of rte_pause(). > > > > I would n't have any issue, if the generated code code > is same or better than the exiting case. but it not the case, Right? > > > > The existing case is actually not interesting (IMO) as it > exposes undefined behaviour which allows the compiler to do anything. But you > seem to be satisfied with "works for me, right here right now". I think the > cost of avoiding undefined behaviour is acceptable (actually I don't think it > even will be noticeable). > > > > > > I am not convinced because of use of volatile in head and > tail indexes. > > > For me that brings the defined behavior. > > > As long as you don't mix in C11 atomic accesses (just use "plain" > accesses to volatile objects), > > > it is AFAIK defined behaviour (but not necessarily using atomic > loads and stores). But I quoted > > > the C11 spec where it explicitly mentions that mixing atomic and > non-atomic accesses to the same > > > object is undefined behaviour. Don't argue with me, argue with > the C11 spec. > > > If you want to disobey the spec, this should at least be called > out for in the code with a comment. > > > > That's boils down only one question, should we follow C11 spec? Why > not only take load > > acquire and store release semantics only just like Linux kernel and > FreeBSD. > > And introduce even more undefined behaviour? > > > Yes. The all world(Linux and Freebsd) is running with undefined behavior > and still it runs. > Are you saying that Linux kernel is using GCC __atomic or C11 > _Atomic/stdatomic.h features?
No. Since your email client is removing the context, I will clarify. I asked: That's boils down to only one question, should we follow C11 spec? Why not only take load acquire and store release semantics only just like Linux kernel and FreeBSD. You replied: And introduce even more undefined behavior? I don't know how that creates more undefined behavior. So replied in the context of your reply that, according to your view even Linux is running with undefined behavior. > I can't see any traces of this in e.g. Linux 4.6. > > It seems like you don't understand. The undefined behaviour comes from mixing > non-atomic and atomic accesses > in the C11 source code. It doesn't have anything to do about if a read or > write of a volatile object is translated > to an instruction that may or may not provide an atomic load or store for > some specific architecture that > you are compiling for. > > Since the Linux kernel (AFAIK) doesn't use C11 _Atomic datatypes or GCC > __atomic builtins, there is no > undefined behaviour per the C11 standard. But it is relying on implementation > specific behaviour (which > I grant is not so likely to change due to backwards compatibility > requirements). > > > > > > > > Does not look like C11 memory model is super efficient in term of > gcc > > implementation. > > You are making a chicken out of a feather. > > > > I think this "problem" with one additional ADD instruction will only > concern __atomic_load_n(__ATOMIC_RELAXED) and > __atomic_store_n(__ATOMIC_RELAXED) because the compiler separates the address > generation (add offset of struct member) from the load or store itself. For > other atomic operations and memory orderings (e.g. > __atomic_load_n(__ATOMIC_ACQUIRE), the extra ADD instruction will be included > anyway (as long as we access a non-first struct member) because e.g. LDAR > only accepts a base register with no offset. > > > > I suggest minimising the imposed memory orderings can have a much > larger (positive) effect on performance compared to avoiding one ADD > instruction (memory accesses are much slower than CPU ALU instructions). > > Using C11 memory model and identifying exactly which objects are used > for synchronisation and whether (any) updates to shared memory are acquired > or released (no updates to shared memory means relaxed order can be used) > will provide maximum freedom to the compiler and hardware to get the best > result. > > No more comments on this. It is not data driven. > Now this is something that would be interesting to benchmark. I don't know to > what > extent current compilers actually utilise the freedoms to optimise memory > accesses > according to e.g. acquire and release ordering of surrounding atomic > operations. > But I would like to know and this could possibly also lead to suggestions to > compiler > developers. I don't think there are too many multithreaded benchmarks and > even fewer > which don't use locks and other synchronisation mechanisms from OS's and > libraries. > But I have some multithreaded applications and benchmarks. > > > > > > > > The FreeBSD and DPDK ring buffers show some fundamental > misunderstandings here. Instead excessive orderings and explicit barriers > have been used as band-aids, with unknown effects on performance. > > > > > > > > > > > > > That the reason why I shared > > > the generated assembly code. If you think other way, Pick any > compiler > > > and see generated output. > > > This is what one compiler for one architecture generates today. > These things change. Other things > > > that used to work or worked for some specific architecture has > stopped working in newer versions of > > > the compiler. > > > > > > > > > And > > > > > > Freebsd implementation of ring buffer(Which DPDK derived > from), Don't have > > > such logic, See > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L108 > > > It looks like FreeBSD uses some kind of C11 atomic memory > model-inspired API although I don't see > > > exactly how e.g. atomic_store_rel_int() is implemented. The code > also mixes in explicit barriers > > > so definitively not pure C11 memory model usage. And finally, it > doesn't establish the proper > > > load-acquire/store-release relationships (e.g. store-release > cons_tail requires a load-acquire cons_tail, > > > same for prod_tail). > > > > > > "* multi-producer safe lock-free ring buffer enqueue" > > > The comment is also wrong. This design is not lock-free, how > could it be when there is spinning > > > (waiting) for other threads in the code? If a thread must wait > for other threads, then by definition > > > the design is blocking. > > > > > > So you are saying that because FreeBSD is doing it wrong, DPDK > can also do it wrong? > > > > > > > > > See below too. > > > > > > > > > > > Skipping the compiler memory barrier in rte_pause() > potentially allows for optimisations that provide much more benefit, e.g. > hiding some cache miss latency for later loads. The DPDK ring buffer > implementation is defined so to enable inlining of enqueue/dequeue functions > into the caller, any code could immediately follow these calls. > > > > > > > > From INTERNATIONAL STANDARD ©ISO/IEC ISO/IEC 9899:201x > > > > Programming languages — C > > > > > > > > 5.1.2.4 > > > > 4 Two expression evaluations conflict if one of them > modifies a memory location and the other one reads or modifies the same > memory location. > > > > > > > > 25 The execution of a program contains a data race if it > contains two conflicting actions in different threads, at least one of which > is not atomic, and neither happens before the other. Any such data race > results in undefined behavior. > > > > > > IMO, Both condition will satisfy if the variable is volatile > and 32bit read will atomic > > > for 32b and 64b machines. If not, the problem persist for > generic case > > > as well(lib/librte_ring/rte_ring_generic.h) > > > The read from a volatile object is not an atomic access per the > C11 spec. It just happens to > > > be translated to an instruction (on x86-64 and AArch64/A64) that > implements an atomic load. > > > I don't think any compiler would change this code generation and > suddenly generate some > > > non-atomic load instruction for a program that *only* uses > volatile to do "atomic" accesses. > > > But a future compiler could detect the mix of atomic and > non-atomic accesses and mark this > > > expression as causing undefined behaviour and that would have > consequences for code generation. > > > > > > > > > I agree with you on C11 memory model semantics usage. The > reason why I > > > propose name for the file as rte_ring_c11_mem.h as DPDK it > self did not > > > had definitions for load acquire and store release semantics. > > > I was looking for taking load acquire and store release > semantics > > > from C11 instead of creating new API like Linux kernel for > FreeBSD(APIs > > > like atomic_load_acq_32(), atomic_store_rel_32()). If the > file name is your > > > concern then we could create new abstractions as well. That > would help > > > exiting KNI problem as well. > > > I appreciate your embrace of the C11 memory model. I think it is > better for describing > > > (both to the compiler and to humans) which and how objects are > used for synchronisation. > > > > > > However, I don't think an API as you suggest (and others have > suggested before, e.g. as > > > done in ODP) is a good idea. There is an infinite amount of > possible base types, an > > > increasing number of operations and a bunch of different memory > orderings, a "complete" > > > API would be very large and difficult to test, and most members > of the API would never be used. > > > GCC and Clang both support the __atomic intrinsics. This API > avoids the problems I > > > described above. Or we could use the official C11 syntax > (stdatomic.h). But then we > > > have the problem with using pre-C11 compilers... > > > > I have no objection, if everyone agrees to move C11 memory model > > with __atomic intrinsics. But if we need to keep both have then > > atomic_load_acq_32() kind of API make sense. > > > > > > > > > > > > > > > > > > > I think, currently it mixed usage because, the same variable > declaration > > > used for C11 vs non C11 usage.Ideally we wont need "volatile" > for C11 > > > case. Either we need to change only to C11 mode OR have APIs > for > > > atomic_load_acq_() and atomic_store_rel_() to allow both > models like > > > Linux kernel and FreeBSD. > > > > > > > > > > > -- Ola > > > > > > > > > > > > > > > > > > > > > > > >