https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697
--- Comment #7 from Matthew Wahab <matthew.wahab at arm dot com> ---
(In reply to torvald from comment #5)
> (In reply to Matthew Wahab from comment #0)
> > The __sync builtins are implemented by expanding them to the equivalent
> > __atomic builtins, using MEMMODEL_SEQ_CST for those that are full barriers.
> > This is too weak since __atomic operations with MEMMODEL_SEQ_CST still allow
> > some data references to move across the operation (see
> > https://gcc.gnu.org/ml/gcc/2014-02/msg00058.html)
> 
> I don't think that's a precise characterization.  The difference discussed
> in that thread is that seq-cst fences have a stronger effect than seq-cst
> stores.  This is not really a question of MEMMODEL_SEQ_CST but what you
> apply it to.

Ok, my point was just that an __sync operation has a stronger barrier that an
__atomic operation.

> This test case is problematic in that it has a data race on bar unless bar
> is never modified by another thread, in which case it would be fine to load
> bar before the fetch_add.  It also loads bar irrespective of which value the
> fetch_add actually returns.  I'm aware that the __sync builtins offer no
> MEMMODEL_RELAXED, and so lots of code simply has a plain memory access.  
> 
> Nonetheless, I think you should also look at a test case that is properly
> synchronized, data-race-free C11 code with the builtins, and see whether
> that is compiled differently (ie, either use a relaxed MO load for bar or
> make the load of bar conditional on the outcome of the fetch_add).
> 
> To be on the safe side, you can also use the cppmem tool to verify what the
> C11 model requires the compiled code to guarantee.  Here's what you probably
> want to achieve:

I agree that this wouldn't affect valid C11 code (because of data-races) but my
understanding is that __sync builtins don't require a C11 model. The problem is
that the __syncs are being implemented using atomic operations that do assume a
C11 model and its notion of program validity. This looks like it would lead to
differences in behaviour when code, using only the __sync builtins and knowing
nothing of C11, is moved between targets with different memory models.


> > ----
> > produces
> > ----
> > T1:
> >     adrp    x0, .LANCHOR0
> >     add     x0, x0, :lo12:.LANCHOR0
> > .L2:
> >     ldaxr   w1, [x0]       ; load-acquire (__sync_fetch_and_add)
> >     add     w1, w1, 1
> >     stlxr   w2, w1, [x0]   ; store-release  (__sync_fetch_and_add)
> >     cbnz    w2, .L2
> >     ldr     w0, [x0, 4]    ; load (return bar)
> >     ret
> > ----
> > With this code, the load can be executed ahead of the store-release which
> > ends the __sync_fetch_and_add. A correct implementation should emit a dmb
> > instruction after the cbnz.
> 
> I'm not sufficiently familiar with ARM to assess whether that's correct.  Is
> this ARMv8?  It seems to be consistent with the mapping to machine code that
> GCC has followed (http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html),
> which does not include a DMB.  However, it also shows no difference between
> an acq_rel CAS and a seq_cst CAS.  Could it be that the CAS sequence itself
> (ie, the machine code for it) prevents the reordering you are concerned
> about at the HW level?

Yes, its ARMv8. It's consistent with the code expected for the atomics used to
implement the __sync_fetch_and_add. I believe that acquire-release will
normally be treated as seq-cst for the aarch64 back-end (I'd have to
double-check though).

Reply via email to