Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load

Ola Liljedahl Mon, 08 Oct 2018 02:22:54 -0700

On 08/10/2018, 08:06, "Jerin Jacob" <[email protected]> wrote:


    -----Original Message-----
    > Date: Sun, 7 Oct 2018 20:44:54 +0000
    > From: Ola Liljedahl <[email protected]>
    > To: Jerin Jacob <[email protected]>
    > CC: "[email protected]" <[email protected]>, Honnappa Nagarahalli
    >  <[email protected]>, "Ananyev, Konstantin"
    >  <[email protected]>, "Gavin Hu (Arm Technology China)"
    >  <[email protected]>, Steve Capper <[email protected]>, nd 
<[email protected]>,
    >  "[email protected]" <[email protected]>
    > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
    > user-agent: Microsoft-MacOutlook/10.11.0.180909
    > 
    
    
    Could you please fix the email client for inline reply.
Sorry that doesn't seem to be possible with Outlook for Mac 16 or Office365. 
The official Office365/Outlook
documentation doesn't match the actual user interface...


    
    https://www.kernel.org/doc/html/v4.19-rc7/process/email-clients.html
    
    
    > 
    > On 07/10/2018, 06:03, "Jerin Jacob" <[email protected]> 
wrote:
    > 
    >     In arm64 case, it will have ATOMIC_RELAXED followed by asm volatile 
("":::"memory") of rte_pause().
    >     I would n't have any issue, if the generated code code is same or 
better than the exiting case. but it not the case, Right?
    > The existing case is actually not interesting (IMO) as it exposes 
undefined behaviour which allows the compiler to do anything. But you seem to 
be satisfied with "works for me, right here right now". I think the cost of 
avoiding undefined behaviour is acceptable (actually I don't think it even will 
be noticeable).
    
    I am not convinced because of use of volatile in head and tail indexes.
    For me that brings the defined behavior.
As long as you don't mix in C11 atomic accesses (just use "plain" accesses to 
volatile objects),
it is AFAIK defined behaviour (but not necessarily using atomic loads and 
stores). But I quoted
the C11 spec where it explicitly mentions that mixing atomic and non-atomic 
accesses to the same
object is undefined behaviour. Don't argue with me, argue with the C11 spec.
If you want to disobey the spec, this should at least be called out for in the 
code with a comment.


    That the reason why I shared
    the generated assembly code. If you think other way, Pick any compiler
    and see generated output.
This is what one compiler for one architecture generates today. These things 
change. Other things
that used to work or worked for some specific architecture has stopped working 
in newer versions of
the compiler.

    
    And
    
    Freebsd implementation of ring buffer(Which DPDK derived from), Don't have
    such logic, See 
https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L108
It looks like FreeBSD uses some kind of C11 atomic memory model-inspired API 
although I don't see
exactly how e.g. atomic_store_rel_int() is implemented. The code also mixes in 
explicit barriers
so definitively not pure C11 memory model usage. And finally, it doesn't 
establish the proper
load-acquire/store-release relationships (e.g. store-release cons_tail requires 
a load-acquire cons_tail,
same for prod_tail).

"* multi-producer safe lock-free ring buffer enqueue"
The comment is also wrong. This design is not lock-free, how could it be when 
there is spinning
(waiting) for other threads in the code? If a thread must wait for other 
threads, then by definition
the design is blocking.

So you are saying that because FreeBSD is doing it wrong, DPDK can also do it 
wrong?

    
    See below too.
    
    > 
    > Skipping the compiler memory barrier in rte_pause() potentially allows 
for optimisations that provide much more benefit, e.g. hiding some cache miss 
latency for later loads. The DPDK ring buffer implementation is defined so to 
enable inlining of enqueue/dequeue functions into the caller, any code could 
immediately follow these calls.
    > 
    > From INTERNATIONAL STANDARD ©ISO/IEC ISO/IEC 9899:201x
    > Programming languages — C
    > 
    > 5.1.2.4
    > 4 Two expression evaluations conflict if one of them modifies a memory 
location and the other one reads or modifies the same memory location.
    > 
    > 25 The execution of a program contains a data race if it contains two 
conflicting actions in different threads, at least one of which is not atomic, 
and neither happens before the other. Any such data race results in undefined 
behavior.
    
    IMO, Both condition will satisfy if the variable is volatile and 32bit read 
will atomic
    for 32b and 64b machines. If not, the problem persist for generic case
    as well(lib/librte_ring/rte_ring_generic.h)
The read from a volatile object is not an atomic access per the C11 spec. It 
just happens to
be translated to an instruction (on x86-64 and AArch64/A64) that implements an 
atomic load.
I don't think any compiler would change this code generation and suddenly 
generate some
non-atomic load instruction for a program that *only* uses volatile to do 
"atomic" accesses.
But a future compiler could detect the mix of atomic and non-atomic accesses 
and mark this
expression as causing undefined behaviour and that would have consequences for 
code generation.
    
    
    I agree with you on C11 memory model semantics usage. The reason why I
    propose name for the file as rte_ring_c11_mem.h as DPDK it self did not
    had definitions for load acquire and store release semantics.
    I was looking for taking load acquire and store release semantics
    from C11 instead of creating new API like Linux kernel for FreeBSD(APIs
    like  atomic_load_acq_32(), atomic_store_rel_32()). If the file name is your
    concern then we could create new abstractions as well. That would help
    exiting KNI problem as well.
I appreciate your embrace of the C11 memory model. I think it is better for 
describing
(both to the compiler and to humans) which and how objects are used for 
synchronisation.

However, I don't think an API as you suggest (and others have suggested before, 
e.g. as
done in ODP) is a good idea. There is an infinite amount of possible base 
types, an
increasing number of operations and a bunch of different memory orderings, a 
"complete"
API would be very large and difficult to test, and most members of the API 
would never be used.

GCC and Clang both support the __atomic intrinsics. This API avoids the 
problems I
described above. Or we could use the official C11 syntax (stdatomic.h). But 
then we
have the problem with using pre-C11 compilers...



    
    I think, currently it mixed usage because, the same variable declaration
    used for C11 vs non C11 usage.Ideally we wont need "volatile" for C11
    case. Either we need to change only to C11 mode OR have APIs for 
    atomic_load_acq_() and atomic_store_rel_() to allow both models like
    Linux kernel and FreeBSD.
    
    > 
    > -- Ola
    > 
    > 
    >

Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load

Reply via email to