If using -mcx16 (directly or indirectly) today, then cmpxchg16b is used to implement atomic loads too. I consider this a bug because it can result in a store being issued (e.g., when loading from a read-only page, or when trying to do a volatile atomic load), and because it can increase contention (which makes atomic loads perform much different than HW load instructions would). See the thread "GCC libatomic ABI specification draft" for more background.
It's not quite obvious how to fix this. We can't just ignore -mcx16; only using it if enabled directly also doesn't seem like a useful approach. We could try to deprecate -mcx16, but that doesn't really give us a solution for options that imply -mcx16 (we can't deprecate all of them). * Option 1: We could never issue cmpxchg16b and instead call out to libatomic, for both __sync and __atomic builtins. However, that would create an additional dependency on libatomic, and would be incompatible when combined with old code that uses cmpxchg16b directly. * Option 2: We could try to alter the meaning of -mcx16 to *also* assert that SSE loads can be used for 16-byte atomic loads. In this scenario, we could then use the SSE loads to implement loads in __atomic*, and keep __sync* unchanged. This would not break combining new code with old code, and would keep __sync compatible with __atomic. However, I have not enough information about when SSE loads are atomic; this information would have to come from the x86 vendors, because we would depend on it (and the ABI we set up would depend on it). In the best case, all processors that support cmpxchg16b would also have atomic SSE loads. However, if this implication does not hold for a significant set of processors, then this change would alter when -mcx16 is safe to be used; it may also be a problem for other options implying -mcx16. * Option 3a: -mcx16 continues to only mean that cmpxchg16b is available, and we keep __sync builtins unchanged. This doesn't break valid uses of __sync* (eg, if they didn't need atomic loads at all). We change __atomic for 16-byte to not use cmpxchg16b but to instead call out to libatomic. libatomic would continue to use cmpxchg16b internally. We retain compatibility between __atomic and __sync. We do not change __atomic_*_lock_free. This does not fix the load-via-cmpxchg bug, but makes sure that we reroute through libatomic early for the __atomic builtins, so that it becomes easier in the future to either do something like Option 2 or Option 3c. Until then, nothing would really change. * Option 3b: Like Option 3a, except that __atomic_*_lock_free return false for 16 bytes. The benefit over 3a is that this stops advertising "fast" atomics when that is arguably not the case because the loads are slowed down by contention (I assume a lot more users read "lock-free" as "fast" instead of thinking about progress conditions). The potential downside is that programs may exist that assert(__atomic_always_lock_free(16,0)); these assertions would fail, although the rest of the program would continue to work. * Option 3c: Like Option 3b, but libatomic would not use cmpxchg16b internally but fall back to locks for 16-byte atomics. This fixes the load-via-cmpxchg bug, but breaks compatibility between old __atomic-using code and new __atomic-using code, and between __sync and new __atomic. * Option 4: Introduce a -mload16atomic option or similar that asserts that true 16-byte atomic loads are supported by the hardware (eg, through SSE). Enable this option for all processors where we know that it is true. Don't change __sync. Change __atomic to use the 16-byte atomic loads if available, and otherwise continue to use cmpxchg16b. Return false from __atomic_*_lock_free(16, ...) if 16-byte atomic loads are not available. I think I prefer Option 3b as the short-term solution. It does not break programs (except the __atomic_always_lock_free assertion scenario, but that's likely to not work anyway given that the atomics will be lock-free but not "fast"). It makes programs aware that the atomics will not be fast when they are not fast indeed (ie, when getting loads through cmpxchg). It also enables us to fix more programs earlier than under Option 4, for example (in the sense that Option 3b requires only a change to libatomic, but no recompilation). The usage of __atomic should be less frequent today than in a year, I believe, and so this can make a difference. It introduces the function call overhead, but that isn't much of a problem compared to atomic loads suffering from contention. I'm worried that Option 4 would not be possible until some time in the future when we have actually gotten confirmation from the HW vendors about 16-byte atomic loads. The additional risk is that we may never get such a confirmation (eg, because they do not want to constrain future HW), or that this actually holds just for a few processors. Option 3b does not preclude us from applying Option 4 selectively in the future (ie, for processors that do have true 16-byte atomic loads). Thoughts? We're awfully close to stage 3 end, but I'd prefer if this could do something for this release. I worry that dragging this out to the next release in a year will just make things worse, given that the __atomic builtins are still a relatively recent feature. To which extent could we do some of these options still (early) in stage 4?