On Thu, 2017-02-02 at 14:48 +0000, Ramana Radhakrishnan wrote:
> On 30/01/17 18:54, Torvald Riegel wrote:
> > This patch fixes the __atomic builtins to not implement supposedly
> > lock-free atomic loads based on just a compare-and-swap operation.
> >
> > If there is no hardware-backed atomic load for a certain memory
> > location, the current implementation can implement the load with a CAS
> > while claiming that the access is lock-free.  This is a bug in the cases
> > of volatile atomic loads and atomic loads to read-only-mapped memory; it
> > also creates a lot of contention in case of concurrent atomic loads,
> > which results in at least counter-intuitive performance because most
> > users probably understand "lock-free" to mean hardware-backed (and thus
> > "fast") instead of just in the progress-criteria sense.
> >
> > This patch implements option 3b of the choices described here:
> > https://gcc.gnu.org/ml/gcc/2017-01/msg00167.html
> 
> 
> Will Deacon pointed me at this thread asking if something similar could 
> be done on ARM.

It would be nice if someone more familiar with ARM could double-check
that ARM is not affected.  I guess ARM isn't, but that's based on me
looking at machine descriptions, which I hadn't ever done before working
on this patch...

The patch introduces a can_atomic_load_p, which checks that an entry
exists in optabs and the machine descriptions for an atomic load of the
particular mode.

IIRC, arm and aarch64 had atomic load patterns defined whenever they had
a CAS defined.  It would be nice if you could double-check that.  If
that's the case, nothing should change with my patch because
can_atomic_load_p would always return true if a CAS could be issued.
If that's not the case, arm/aarch64 could be in the same boat as x86_64
with cmpxchg16b; then, using Option 3b might be a useful (intermediary)
solution for ARM as well (OTOH, compatibility with existing code and
__sync builtins might perhaps not be as much of a factor as it might be
on x86_64).

The atomic load patterns you have could still be wrong, for example by
generating a LDXP/STXP loop or something else that can store on an
atomic load.  In that case, the problem is similar as not having a
custom load pattern, and the second case in the previous paragraph
applies.

> On armv8-a we can implement an atomic load of 16 bytes using an LDXP / 
> STXP loop as a 16 byte load isnt' single copy atomic. On armv8.1-a we do 
> have a CAS on 16 bytes.
> 
> This was discussed last year here.
> 
> https://gcc.gnu.org/ml/gcc/2016-06/msg00017.html
> 
> and the consensus seemed to be that we couldn't really do a CAS on 
> readonly data as we would have no way of avoiding a trap.

Yes, and I agree that not doing a CAS (or anything that can store on
atomic load) is the right thing to do.

BTW, on the ARM targets, is it possible that the __sync builtins would
use LDXP/STXP on 16 byte and the __atomic loads would fall back to using
libatomic and locks?  Or do you simply not provide 16-byte __sync
builtins?

> I'm sure I'm missing something piece of the puzzle, so I'd be interested 
> in how you avoid this in this implementation on x86_64 with what I can 
> see is a CAS.

If we'd start from scratch, we wouldn't use cmpxchg16b if we don't have
a 16-byte atomic load.  However, we did do that, and there might be code
out there that does have inlined cmpxchg16b.  As discussed in the thread
you cited, changing GCC 7 back to libatomics *and* the use of locks
would be an effective ABI break.  Therefore, my patch (and Option 3b)
makes a compromise and delegates to libatomic but libatomic gets custom
code to actually keep using cmpxchg16b.  That means that the actual
synchronization doesn't change, but we constrain this problem to
libatomic and prevent application code generated with GCC 7 from being
directly affected.

Does this clarify the x86_64 situation?

Reply via email to