-----Original Message-----
> Date: Mon, 8 Oct 2018 11:21:42 +0000
> From: Ola Liljedahl <ola.liljed...@arm.com>
> To: Jerin Jacob <jerin.ja...@caviumnetworks.com>
> CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>  <honnappa.nagaraha...@arm.com>, "Ananyev, Konstantin"
>  <konstantin.anan...@intel.com>, "Gavin Hu (Arm Technology China)"
>  <gavin...@arm.com>, Steve Capper <steve.cap...@arm.com>, nd <n...@arm.com>,
>  "sta...@dpdk.org" <sta...@dpdk.org>
> Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> user-agent: Microsoft-MacOutlook/10.11.0.180909
> 
> 
> On 08/10/2018, 12:47, "Jerin Jacob" <jerin.ja...@caviumnetworks.com> wrote:
> 
>     -----Original Message-----
>     > Date: Mon, 8 Oct 2018 10:25:45 +0000
>     > From: Ola Liljedahl <ola.liljed...@arm.com>
>     > To: Jerin Jacob <jerin.ja...@caviumnetworks.com>
>     > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>     >  <honnappa.nagaraha...@arm.com>, "Ananyev, Konstantin"
>     >  <konstantin.anan...@intel.com>, "Gavin Hu (Arm Technology China)"
>     >  <gavin...@arm.com>, Steve Capper <steve.cap...@arm.com>, nd 
> <n...@arm.com>,
>     >  "sta...@dpdk.org" <sta...@dpdk.org>
>     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
>     > user-agent: Microsoft-MacOutlook/10.11.0.180909
>     >
>     >
>     > On 08/10/2018, 12:00, "Jerin Jacob" <jerin.ja...@caviumnetworks.com> 
> wrote:
>     >
>     >     -----Original Message-----
>     >     > Date: Mon, 8 Oct 2018 09:22:05 +0000
>     >     > From: Ola Liljedahl <ola.liljed...@arm.com>
>     >     > To: Jerin Jacob <jerin.ja...@caviumnetworks.com>
>     >     > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>     >     >  <honnappa.nagaraha...@arm.com>, "Ananyev, Konstantin"
>     >     >  <konstantin.anan...@intel.com>, "Gavin Hu (Arm Technology China)"
>     >     >  <gavin...@arm.com>, Steve Capper <steve.cap...@arm.com>, nd 
> <n...@arm.com>,
>     >     >  "sta...@dpdk.org" <sta...@dpdk.org>
>     >     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
>     >     > user-agent: Microsoft-MacOutlook/10.11.0.180909
>     >     >
>     >     > External Email
>     >     >
>     >     > On 08/10/2018, 08:06, "Jerin Jacob" 
> <jerin.ja...@caviumnetworks.com> wrote:
>     >     >
>     >     >     -----Original Message-----
>     >     >     > Date: Sun, 7 Oct 2018 20:44:54 +0000
>     >     >     > From: Ola Liljedahl <ola.liljed...@arm.com>
>     >     >     > To: Jerin Jacob <jerin.ja...@caviumnetworks.com>
>     >     >     > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>     >     >     >  <honnappa.nagaraha...@arm.com>, "Ananyev, Konstantin"
>     >     >     >  <konstantin.anan...@intel.com>, "Gavin Hu (Arm Technology 
> China)"
>     >     >     >  <gavin...@arm.com>, Steve Capper <steve.cap...@arm.com>, 
> nd <n...@arm.com>,
>     >     >     >  "sta...@dpdk.org" <sta...@dpdk.org>
>     >     >     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic 
> load
>     >     >     > user-agent: Microsoft-MacOutlook/10.11.0.180909
>     >     >     >
>     >     >
>     >     >
>     >     >     Could you please fix the email client for inline reply.
>     >     > Sorry that doesn't seem to be possible with Outlook for Mac 16 or 
> Office365. The official Office365/Outlook
>     >     > documentation doesn't match the actual user interface...
>     >     >
>     >     >
>     >     >
>     >     >     
> https://www.kernel.org/doc/html/v4.19-rc7/process/email-clients.html
>     >     >
>     >     >
>     >     >     >
>     >     >     > On 07/10/2018, 06:03, "Jerin Jacob" 
> <jerin.ja...@caviumnetworks.com> wrote:
>     >     >     >
>     >     >     >     In arm64 case, it will have ATOMIC_RELAXED followed by 
> asm volatile ("":::"memory") of rte_pause().
>     >     >     >     I would n't have any issue, if the generated code code 
> is same or better than the exiting case. but it not the case, Right?
>     >     >     > The existing case is actually not interesting (IMO) as it 
> exposes undefined behaviour which allows the compiler to do anything. But you 
> seem to be satisfied with "works for me, right here right now". I think the 
> cost of avoiding undefined behaviour is acceptable (actually I don't think it 
> even will be noticeable).
>     >     >
>     >     >     I am not convinced because of use of volatile in head and 
> tail indexes.
>     >     >     For me that brings the defined behavior.
>     >     > As long as you don't mix in C11 atomic accesses (just use "plain" 
> accesses to volatile objects),
>     >     > it is AFAIK defined behaviour (but not necessarily using atomic 
> loads and stores). But I quoted
>     >     > the C11 spec where it explicitly mentions that mixing atomic and 
> non-atomic accesses to the same
>     >     > object is undefined behaviour. Don't argue with me, argue with 
> the C11 spec.
>     >     > If you want to disobey the spec, this should at least be called 
> out for in the code with a comment.
>     >
>     >     That's boils down only one question, should we follow C11 spec? Why 
> not only take load
>     >     acquire and store release semantics only just like Linux kernel and 
> FreeBSD.
>     > And introduce even more undefined behaviour?
> 
> 
>     Yes. The all world(Linux and Freebsd) is running with undefined behavior 
> and still it runs.
> Are you saying that Linux kernel is using GCC __atomic or C11 
> _Atomic/stdatomic.h features?

No.

Since your email client is removing the context, I will clarify.

I asked:
That's boils down to only one question, should we follow C11 spec? Why not
only take load acquire and store release semantics only just like Linux kernel 
and FreeBSD.

You replied:
And introduce even more undefined behavior?

I don't know how that creates more undefined behavior. So replied in the
context of your reply that, according to your view even Linux is running
with undefined behavior.



> I can't see any traces of this in e.g. Linux 4.6.
> 
> It seems like you don't understand. The undefined behaviour comes from mixing 
> non-atomic and atomic accesses
> in the C11 source code. It doesn't have anything to do about if a read or 
> write of a volatile object is translated
> to an instruction that may or may not provide an atomic load or store for 
> some specific architecture that
> you are compiling for.
> 
> Since the Linux kernel (AFAIK) doesn't use C11 _Atomic datatypes or GCC 
> __atomic builtins, there is no
> undefined behaviour per the C11 standard. But it is relying on implementation 
> specific behaviour (which
> I grant is not so likely to change due to backwards compatibility 
> requirements).
> 
> 
> 
> 
>     >
>     >     Does not look like C11 memory model is super efficient in term of 
> gcc
>     >     implementation.
>     > You are making a chicken out of a feather.
>     >
>     > I think this "problem" with one additional ADD instruction will only 
> concern __atomic_load_n(__ATOMIC_RELAXED) and 
> __atomic_store_n(__ATOMIC_RELAXED) because the compiler separates the address 
> generation (add offset of struct member) from the load or store itself. For 
> other atomic operations and memory orderings (e.g. 
> __atomic_load_n(__ATOMIC_ACQUIRE), the extra ADD instruction will be included 
> anyway (as long as we access a non-first struct member) because e.g. LDAR 
> only accepts a base register with no offset.
>     >
>     > I suggest minimising the imposed memory orderings can have a much 
> larger (positive) effect on performance compared to avoiding one ADD 
> instruction (memory accesses are much slower than CPU ALU instructions).
>     > Using C11 memory model and identifying exactly which objects are used 
> for synchronisation and whether (any) updates to shared memory are acquired 
> or released (no updates to shared memory means relaxed order can be used) 
> will provide maximum freedom to the compiler and hardware to get the best 
> result.
> 
>     No more comments on this. It is not data driven.
> Now this is something that would be interesting to benchmark. I don't know to 
> what
> extent current compilers actually utilise the freedoms to optimise memory 
> accesses
> according to e.g. acquire and release ordering of surrounding atomic 
> operations.
> But I would like to know and this could possibly also lead to suggestions to 
> compiler
> developers. I don't think there are too many multithreaded benchmarks and 
> even fewer
> which don't use locks and other synchronisation mechanisms from OS's and 
> libraries.
> But I have some multithreaded applications and benchmarks.
> 
> 
> 
> 
>     >
>     > The FreeBSD and DPDK ring buffers show some fundamental 
> misunderstandings here. Instead excessive orderings and explicit barriers 
> have been used as band-aids, with unknown effects on performance.
>     >
>     >
>     >     >
>     >     >
>     >     >     That the reason why I shared
>     >     >     the generated assembly code. If you think other way, Pick any 
> compiler
>     >     >     and see generated output.
>     >     > This is what one compiler for one architecture generates today. 
> These things change. Other things
>     >     > that used to work or worked for some specific architecture has 
> stopped working in newer versions of
>     >     > the compiler.
>     >     >
>     >     >
>     >     >     And
>     >     >
>     >     >     Freebsd implementation of ring buffer(Which DPDK derived 
> from), Don't have
>     >     >     such logic, See 
> https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L108
>     >     > It looks like FreeBSD uses some kind of C11 atomic memory 
> model-inspired API although I don't see
>     >     > exactly how e.g. atomic_store_rel_int() is implemented. The code 
> also mixes in explicit barriers
>     >     > so definitively not pure C11 memory model usage. And finally, it 
> doesn't establish the proper
>     >     > load-acquire/store-release relationships (e.g. store-release 
> cons_tail requires a load-acquire cons_tail,
>     >     > same for prod_tail).
>     >     >
>     >     > "* multi-producer safe lock-free ring buffer enqueue"
>     >     > The comment is also wrong. This design is not lock-free, how 
> could it be when there is spinning
>     >     > (waiting) for other threads in the code? If a thread must wait 
> for other threads, then by definition
>     >     > the design is blocking.
>     >     >
>     >     > So you are saying that because FreeBSD is doing it wrong, DPDK 
> can also do it wrong?
>     >     >
>     >     >
>     >     >     See below too.
>     >     >
>     >     >     >
>     >     >     > Skipping the compiler memory barrier in rte_pause() 
> potentially allows for optimisations that provide much more benefit, e.g. 
> hiding some cache miss latency for later loads. The DPDK ring buffer 
> implementation is defined so to enable inlining of enqueue/dequeue functions 
> into the caller, any code could immediately follow these calls.
>     >     >     >
>     >     >     > From INTERNATIONAL STANDARD ©ISO/IEC ISO/IEC 9899:201x
>     >     >     > Programming languages — C
>     >     >     >
>     >     >     > 5.1.2.4
>     >     >     > 4 Two expression evaluations conflict if one of them 
> modifies a memory location and the other one reads or modifies the same 
> memory location.
>     >     >     >
>     >     >     > 25 The execution of a program contains a data race if it 
> contains two conflicting actions in different threads, at least one of which 
> is not atomic, and neither happens before the other. Any such data race 
> results in undefined behavior.
>     >     >
>     >     >     IMO, Both condition will satisfy if the variable is volatile 
> and 32bit read will atomic
>     >     >     for 32b and 64b machines. If not, the problem persist for 
> generic case
>     >     >     as well(lib/librte_ring/rte_ring_generic.h)
>     >     > The read from a volatile object is not an atomic access per the 
> C11 spec. It just happens to
>     >     > be translated to an instruction (on x86-64 and AArch64/A64) that 
> implements an atomic load.
>     >     > I don't think any compiler would change this code generation and 
> suddenly generate some
>     >     > non-atomic load instruction for a program that *only* uses 
> volatile to do "atomic" accesses.
>     >     > But a future compiler could detect the mix of atomic and 
> non-atomic accesses and mark this
>     >     > expression as causing undefined behaviour and that would have 
> consequences for code generation.
>     >     >
>     >     >
>     >     >     I agree with you on C11 memory model semantics usage. The 
> reason why I
>     >     >     propose name for the file as rte_ring_c11_mem.h as DPDK it 
> self did not
>     >     >     had definitions for load acquire and store release semantics.
>     >     >     I was looking for taking load acquire and store release 
> semantics
>     >     >     from C11 instead of creating new API like Linux kernel for 
> FreeBSD(APIs
>     >     >     like  atomic_load_acq_32(), atomic_store_rel_32()). If the 
> file name is your
>     >     >     concern then we could create new abstractions as well. That 
> would help
>     >     >     exiting KNI problem as well.
>     >     > I appreciate your embrace of the C11 memory model. I think it is 
> better for describing
>     >     > (both to the compiler and to humans) which and how objects are 
> used for synchronisation.
>     >     >
>     >     > However, I don't think an API as you suggest (and others have 
> suggested before, e.g. as
>     >     > done in ODP) is a good idea. There is an infinite amount of 
> possible base types, an
>     >     > increasing number of operations and a bunch of different memory 
> orderings, a "complete"
>     >     > API would be very large and difficult to test, and most members 
> of the API would never be used.
>     >     > GCC and Clang both support the __atomic intrinsics. This API 
> avoids the problems I
>     >     > described above. Or we could use the official C11 syntax 
> (stdatomic.h). But then we
>     >     > have the problem with using pre-C11 compilers...
>     >
>     >     I have no objection, if everyone agrees to move C11 memory model
>     >     with __atomic intrinsics. But if we need to keep both have then
>     >     atomic_load_acq_32() kind of API make sense.
>     >
>     >
>     >     >
>     >     >
>     >     >
>     >     >
>     >     >     I think, currently it mixed usage because, the same variable 
> declaration
>     >     >     used for C11 vs non C11 usage.Ideally we wont need "volatile" 
> for C11
>     >     >     case. Either we need to change only to C11 mode OR have APIs 
> for
>     >     >     atomic_load_acq_() and atomic_store_rel_() to allow both 
> models like
>     >     >     Linux kernel and FreeBSD.
>     >     >
>     >     >     >
>     >     >     > -- Ola
>     >     >     >
>     >     >     >
>     >     >     >
>     >     >
>     >     >
>     >
>     >
> 
> 

Reply via email to