https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697
--- Comment #37 from torvald at gcc dot gnu.org --- (In reply to James Greenhalgh from comment #35) > (In reply to torvald from comment #32) > > (In reply to James Greenhalgh from comment #28) > > > This also gives us an easier route to fixing any issues with the > > > acquire/release __sync primitives (__sync_lock_test_and_set and > > > __sync_lock_release) if we decide that these also need to be stronger than > > > their C++11 equivalents. > > > > I don't think we have another case of different __sync vs. __atomics > > semantics in case of __sync_lock_test_and_set. The current specification > > makes it clear that this is an acquire barrier, and how it describes the > > semantics (ie, loads and stores that are program-order before the acquire op > > can move to after it) , this seems to be consistent with the effects C11 > > specifies for acquire MO (with perhaps the distinction that C11 is clear > > that acquire needs to be paired with some release op to create an ordering > > constraint). > > I think that the question is which parts of a RMW operation with > MEMMODEL_ACQUIRE semantics is ordered. My understanding is that in C++11 > MEMMODEL_ACQUIRE only applies to the "load" half of the operation. So an > observer to: > > atomic_flag_test_and_set_explicit(foo, memory_order_acquire) > atomic_store_exlicit (bar, 1, memory_model_relaxed) That depends on your observer. When reasoning about C11, please let's use complete examples (and ideally, run them through cppmem). For example, if the observation is done by a pair of relaxed-MO loads, bar==1 && foo==0 can be observed: int main() { atomic_int foo = 0; atomic_int bar = 0; {{{ { r1 = cas_strong(foo, 0, 1); bar.store(1, memory_order_relaxed); } ||| { r2 = bar.load(memory_order_relaxed).readsvalue(1); r3 = foo.load(memory_order_relaxed).readsvalue(0); } }}}; return 0; } > Is permitted to observe a write to bar before a write to foo (but not before > the read from foo). How does one observe a load in C11 (ie, so that "before the read from foo" is defined)? You can constrain the reads-from of a load, but the load itself is not observable as an effect. I think I get what you're trying to say regarding the "load half" but I would phrase it differently: If there could be another release store to foo that the CAS/TAS reads-from, then moving the store to bar before the load would risk creating a cycle between inter-thread happens-before and sequenced-before-created intra-thread happens-before. (Note that the later isn't visible to other threads though, except through additional inter-thread synchronization.) Perhaps one could also say that moving the store to bar before the store to foo could result in the CAS/TAS not being atomic -- but this is just reliably observable if the foo observation is a CAS/TAS by itself (so it's totally ordered with the other CAS), or if there is a inter-thread happens-before between the observation of bar and the store to bar (which means using a release store for bar). I'm discussing these aspects because I think it matters which observations are actually possible in a reliable way. It matters for C11, and it may matter for the __sync semantics as well because while the __sync functions promise TSO, we don't really do so for all (nonatomic) loads or stores anywhere else in the program... IOW, can we actually come up with reliable and correct counter-examples (either using the C11 or __sync interfaces) for the guarantees we think ARM might not provide? > My reading of the Itanium ABI is that the acquire barrier applies to the > entire operation (Andrew, I think you copied these over exactly backwards in > comment 34 ;) ): > > "Disallows the movement of memory references to visible data from > after the intrinsic (in program order) to before the intrinsic (this > behavior is desirable at lock-acquire operations, hence the name)." > > The definition of __sync_lock_test_and_set is: > > "Behavior: > • Atomically store the supplied value in *ptr and return the old value > of *ptr. (i.e.) > { tmp = *ptr; *ptr = value; return tmp; } > • Acquire barrier." Hmm. I don't think this list is meant to be a sequence; __sync_lock_release has two items, first the assignment to the memory location, and *second* a release barrier. If this were supposed to be a sequence, it wouldn't be correct for lock releases. It rather seems that, somehow, the whole intrinsic is supposed to be a barrier, and be atomic. Similarly, the __sync RMW ops just have a single full barrier listed under behavior. > So by the strict letter of the specification, no memory references to > visible data should be allowed to move from after the entire body of the > intrinsic to before it. That is to say in: > > __sync_lock_test_and_set (foo, 1) > bar = 1 > > an observer should not be able to observe the write to bar before the write > to foo. This is a difference from the C++11 semantics. Can you clarify how this observer would look like? I think we should look at both code examples that just use __sync for concurrent accesses, and examples that also use normal memory accesses as if those would be guaranteed to be atomic. None of the __sync docs nor the psABI guarantee any of the latter AFAIK. I don't think it would be unreasonable to argue that __sync doesn't make promises about non-DRF normal accesses, and so, strictly speaking, maybe programs can't in fact rely on any of the behaviors that are complicated to implement for ARM. That's why I'd like to distinguish those two cases. Currently, I couldn't say with confidence which semantics __sync really should have, and which semantics we can actually promise in practice. It feels as if we need to have a more thorough model / understanding and more detailed info about the trade-offs before we can make a decision. It feels kind of risky to just make a quick decision here that seems alright but which we don't truly understand -- OTOH, we want to deprecate __sync eventually, so maybe just going the super-conservative route is most practical. > I'm not worried about __sync_lock_release, I think the documentation is > strong enough and unambiguous. Are you aware that the GCC's __sync disallow store-store reordering across __sync_lock_release, whereas the psABI docs don't?