https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697
--- Comment #7 from Matthew Wahab <matthew.wahab at arm dot com> --- (In reply to torvald from comment #5) > (In reply to Matthew Wahab from comment #0) > > The __sync builtins are implemented by expanding them to the equivalent > > __atomic builtins, using MEMMODEL_SEQ_CST for those that are full barriers. > > This is too weak since __atomic operations with MEMMODEL_SEQ_CST still allow > > some data references to move across the operation (see > > https://gcc.gnu.org/ml/gcc/2014-02/msg00058.html) > > I don't think that's a precise characterization. The difference discussed > in that thread is that seq-cst fences have a stronger effect than seq-cst > stores. This is not really a question of MEMMODEL_SEQ_CST but what you > apply it to. Ok, my point was just that an __sync operation has a stronger barrier that an __atomic operation. > This test case is problematic in that it has a data race on bar unless bar > is never modified by another thread, in which case it would be fine to load > bar before the fetch_add. It also loads bar irrespective of which value the > fetch_add actually returns. I'm aware that the __sync builtins offer no > MEMMODEL_RELAXED, and so lots of code simply has a plain memory access. > > Nonetheless, I think you should also look at a test case that is properly > synchronized, data-race-free C11 code with the builtins, and see whether > that is compiled differently (ie, either use a relaxed MO load for bar or > make the load of bar conditional on the outcome of the fetch_add). > > To be on the safe side, you can also use the cppmem tool to verify what the > C11 model requires the compiled code to guarantee. Here's what you probably > want to achieve: I agree that this wouldn't affect valid C11 code (because of data-races) but my understanding is that __sync builtins don't require a C11 model. The problem is that the __syncs are being implemented using atomic operations that do assume a C11 model and its notion of program validity. This looks like it would lead to differences in behaviour when code, using only the __sync builtins and knowing nothing of C11, is moved between targets with different memory models. > > ---- > > produces > > ---- > > T1: > > adrp x0, .LANCHOR0 > > add x0, x0, :lo12:.LANCHOR0 > > .L2: > > ldaxr w1, [x0] ; load-acquire (__sync_fetch_and_add) > > add w1, w1, 1 > > stlxr w2, w1, [x0] ; store-release (__sync_fetch_and_add) > > cbnz w2, .L2 > > ldr w0, [x0, 4] ; load (return bar) > > ret > > ---- > > With this code, the load can be executed ahead of the store-release which > > ends the __sync_fetch_and_add. A correct implementation should emit a dmb > > instruction after the cbnz. > > I'm not sufficiently familiar with ARM to assess whether that's correct. Is > this ARMv8? It seems to be consistent with the mapping to machine code that > GCC has followed (http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html), > which does not include a DMB. However, it also shows no difference between > an acq_rel CAS and a seq_cst CAS. Could it be that the CAS sequence itself > (ie, the machine code for it) prevents the reordering you are concerned > about at the HW level? Yes, its ARMv8. It's consistent with the code expected for the atomics used to implement the __sync_fetch_and_add. I believe that acquire-release will normally be treated as seq-cst for the aarch64 back-end (I'd have to double-check though).