-mcx16 vs. not using CAS for atomic loads

Torvald Riegel Thu, 19 Jan 2017 10:24:05 -0800

If using -mcx16 (directly or indirectly) today, then cmpxchg16b is used
to implement atomic loads too.  I consider this a bug because it can
result in a store being issued (e.g., when loading from a read-only
page, or when trying to do a volatile atomic load), and because it can
increase contention (which makes atomic loads perform much different
than HW load instructions would).  See the thread "GCC libatomic ABI
specification draft" for more background.


It's not quite obvious how to fix this.  We can't just ignore -mcx16;
only using it if enabled directly also doesn't seem like a useful
approach.

We could try to deprecate -mcx16, but that doesn't really give us a
solution for options that imply -mcx16 (we can't deprecate all of them).

* Option 1:
We could never issue cmpxchg16b and instead call out to libatomic, for
both __sync and __atomic builtins.  However, that would create an
additional dependency on libatomic, and would be incompatible when
combined with old code that uses cmpxchg16b directly.

* Option 2:
We could try to alter the meaning of -mcx16 to *also* assert that SSE
loads can be used for 16-byte atomic loads.  In this scenario, we could
then use the SSE loads to implement loads in __atomic*, and keep __sync*
unchanged.  This would not break combining new code with old code, and
would keep __sync compatible with __atomic.
However, I have not enough information about when SSE loads are atomic;
this information would have to come from the x86 vendors, because we
would depend on it (and the ABI we set up would depend on it).
In the best case, all processors that support cmpxchg16b would also have
atomic SSE loads.  However, if this implication does not hold for a
significant set of processors, then this change would alter when -mcx16
is safe to be used; it may also be a problem for other options implying
-mcx16.

* Option 3a:
-mcx16 continues to only mean that cmpxchg16b is available, and we keep
__sync builtins unchanged.  This doesn't break valid uses of __sync*
(eg, if they didn't need atomic loads at all).
We change __atomic for 16-byte to not use cmpxchg16b but to instead call
out to libatomic.  libatomic would continue to use cmpxchg16b
internally.  We retain compatibility between __atomic and __sync.  We do
not change __atomic_*_lock_free.
This does not fix the load-via-cmpxchg bug, but makes sure that we
reroute through libatomic early for the __atomic builtins, so that it
becomes easier in the future to either do something like Option 2 or
Option 3c.  Until then, nothing would really change.

* Option 3b:
Like Option 3a, except that __atomic_*_lock_free return false for 16
bytes.  The benefit over 3a is that this stops advertising "fast"
atomics when that is arguably not the case because the loads are slowed
down by contention (I assume a lot more users read "lock-free" as "fast"
instead of thinking about progress conditions).  The potential downside
is that programs may exist that assert(__atomic_always_lock_free(16,0));
these assertions would fail, although the rest of the program would
continue to work.

* Option 3c:
Like Option 3b, but libatomic would not use cmpxchg16b internally but
fall back to locks for 16-byte atomics.  This fixes the load-via-cmpxchg
bug, but breaks compatibility between old __atomic-using code and new
__atomic-using code, and between __sync and new __atomic.

* Option 4:
Introduce a -mload16atomic option or similar that asserts that true
16-byte atomic loads are supported by the hardware (eg, through SSE).
Enable this option for all processors where we know that it is true.
Don't change __sync.  Change __atomic to use the 16-byte atomic loads if
available, and otherwise continue to use cmpxchg16b.  Return false from
__atomic_*_lock_free(16, ...) if 16-byte atomic loads are not available.


I think I prefer Option 3b as the short-term solution.  It does not
break programs (except the __atomic_always_lock_free assertion scenario,
but that's likely to not work anyway given that the atomics will be
lock-free but not "fast").  It makes programs aware that the atomics
will not be fast when they are not fast indeed (ie, when getting loads
through cmpxchg).
It also enables us to fix more programs earlier than under Option 4, for
example (in the sense that Option 3b requires only a change to
libatomic, but no recompilation).   The usage of __atomic should be less
frequent today than in a year, I believe, and so this can make a
difference.  It introduces the function call overhead, but that isn't
much of a problem compared to atomic loads suffering from contention.

I'm worried that Option 4 would not be possible until some time in the
future when we have actually gotten confirmation from the HW vendors
about 16-byte atomic loads.  The additional risk is that we may never
get such a confirmation (eg, because they do not want to constrain
future HW), or that this actually holds just for a few processors.

Option 3b does not preclude us from applying Option 4 selectively in the
future (ie, for processors that do have true 16-byte atomic loads).

Thoughts?


We're awfully close to stage 3 end, but I'd prefer if this could do
something for this release.  I worry that dragging this out to the next
release in a year will just make things worse, given that the __atomic
builtins are still a relatively recent feature.

To which extent could we do some of these options still (early) in stage
4?

-mcx16 vs. not using CAS for atomic loads

Reply via email to