https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697
--- Comment #43 from James Greenhalgh <jgreenhalgh at gcc dot gnu.org> --- (In reply to torvald from comment #37) > (In reply to James Greenhalgh from comment #35) > > (In reply to torvald from comment #32) > > > (In reply to James Greenhalgh from comment #28) > > > > This also gives us an easier route to fixing any issues with the > > > > acquire/release __sync primitives (__sync_lock_test_and_set and > > > > __sync_lock_release) if we decide that these also need to be stronger > > > > than > > > > their C++11 equivalents. > > > > > > I don't think we have another case of different __sync vs. __atomics > > > semantics in case of __sync_lock_test_and_set. The current specification > > > makes it clear that this is an acquire barrier, and how it describes the > > > semantics (ie, loads and stores that are program-order before the acquire > > > op > > > can move to after it) , this seems to be consistent with the effects C11 > > > specifies for acquire MO (with perhaps the distinction that C11 is clear > > > that acquire needs to be paired with some release op to create an ordering > > > constraint). > > > > I think that the question is which parts of a RMW operation with > > MEMMODEL_ACQUIRE semantics is ordered. My understanding is that in C++11 > > MEMMODEL_ACQUIRE only applies to the "load" half of the operation. So an > > observer to: > > > > atomic_flag_test_and_set_explicit(foo, memory_order_acquire) > > atomic_store_exlicit (bar, 1, memory_model_relaxed) > > That depends on your observer. When reasoning about C11, please let's use > complete examples (and ideally, run them through cppmem). For example, if > the observation is done by a pair of relaxed-MO loads, bar==1 && foo==0 can > be observed: > > int main() { > atomic_int foo = 0; > atomic_int bar = 0; > {{{ { > r1 = cas_strong(foo, 0, 1); > bar.store(1, memory_order_relaxed); > } ||| { > r2 = bar.load(memory_order_relaxed).readsvalue(1); > r3 = foo.load(memory_order_relaxed).readsvalue(0); > } }}}; > return 0; } Thanks for that, and sorry again for my sloppy terminology. I did try to write a cppmem example, but I couldn't find the syntax for a CAS. If I could have, this is what I would have written, and gets the results I had expected and was trying to express. > > > Is permitted to observe a write to bar before a write to foo (but not before > > the read from foo). > > How does one observe a load in C11 (ie, so that "before the read from foo" > is defined)? You can constrain the reads-from of a load, but the load > itself is not observable as an effect. Sorry, this is ARM memory model terminology leaking across to my C11 examples, but even then what I was saying doesn't make sense. To observe a load in the ARM model: "A read of a location in memory is said to be observed by an observer when a subsequent write to the location by the same observer has no effect on the value returned by the read." But you are right, this is not interesting to model. > I think I get what you're trying to say regarding the "load half" but I > would phrase it differently: If there could be another release store to foo > that the CAS/TAS reads-from, then moving the store to bar before the load > would risk creating a cycle between inter-thread happens-before and > sequenced-before-created intra-thread happens-before. (Note that the later > isn't visible to other threads though, except through additional > inter-thread synchronization.) > > Perhaps one could also say that moving the store to bar before the store to > foo could result in the CAS/TAS not being atomic -- but this is just > reliably observable if the foo observation is a CAS/TAS by itself (so it's > totally ordered with the other CAS), or if there is a inter-thread > happens-before between the observation of bar and the store to bar (which > means using a release store for bar). > > I'm discussing these aspects because I think it matters which observations > are actually possible in a reliable way. It matters for C11, and it may > matter for the __sync semantics as well because while the __sync functions > promise TSO, we don't really do so for all (nonatomic) loads or stores > anywhere else in the program... > > IOW, can we actually come up with reliable and correct counter-examples > (either using the C11 or __sync interfaces) for the guarantees we think ARM > might not provide? > > > My reading of the Itanium ABI is that the acquire barrier applies to the > > entire operation (Andrew, I think you copied these over exactly backwards in > > comment 34 ;) ): > > > > "Disallows the movement of memory references to visible data from > > after the intrinsic (in program order) to before the intrinsic (this > > behavior is desirable at lock-acquire operations, hence the name)." > > > > The definition of __sync_lock_test_and_set is: > > > > "Behavior: > > • Atomically store the supplied value in *ptr and return the old value > > of *ptr. (i.e.) > > { tmp = *ptr; *ptr = value; return tmp; } > > • Acquire barrier." > > Hmm. I don't think this list is meant to be a sequence; __sync_lock_release > has two items, first the assignment to the memory location, and *second* a > release barrier. If this were supposed to be a sequence, it wouldn't be > correct for lock releases. It rather seems that, somehow, the whole > intrinsic is supposed to be a barrier, and be atomic. Similarly, the __sync > RMW ops just have a single full barrier listed under behavior. Yes, I agree and that was the interpretation I was getting at. I believe that this is a consequence of the Itanium "cmpxchg" instruction, which implements the intrinsic in one instruction, and thus does not need to care about its decomposition. > > So by the strict letter of the specification, no memory references to > > visible data should be allowed to move from after the entire body of the > > intrinsic to before it. That is to say in: > > > > __sync_lock_test_and_set (foo, 1) > > bar = 1 > > > > an observer should not be able to observe the write to bar before the write > > to foo. This is a difference from the C++11 semantics. > > Can you clarify how this observer would look like? I think we should look > at both code examples that just use __sync for concurrent accesses, and > examples that also use normal memory accesses as if those would be > guaranteed to be atomic. None of the __sync docs nor the psABI guarantee > any of the latter AFAIK. I don't think it would be unreasonable to argue > that __sync doesn't make promises about non-DRF normal accesses, and so, > strictly speaking, maybe programs can't in fact rely on any of the behaviors > that are complicated to implement for ARM. That's why I'd like to > distinguish those two cases. Sure, it would look much like your cppmem example above, though you can add some barriers to the observer if you want to make a stronger example bar = 0, foo = 0; thread_a { __sync_lock_test_and_set (foo, 1) bar = 1 } thread_b { /* If we can see the write to bar, the write to foo must also have happened. */ if (bar) /* Reads 1. */ assert (foo) /* Should never read 0. */ } > Currently, I couldn't say with confidence which semantics __sync really > should have, and which semantics we can actually promise in practice. It > feels as if we need to have a more thorough model / understanding and more > detailed info about the trade-offs before we can make a decision. It feels > kind of risky to just make a quick decision here that seems alright but > which we don't truly understand -- OTOH, we want to deprecate __sync > eventually, so maybe just going the super-conservative route is most > practical. Yes agreed. > > I'm not worried about __sync_lock_release, I think the documentation is > > strong enough and unambiguous. > > Are you aware that the GCC's __sync disallow store-store reordering across > __sync_lock_release, whereas the psABI docs don't? No I was not, and even looking exactly for the text you were referring to, it took me three attempts to spot it. Yes, I understand now why you are concerned about the GCC wording. Perhaps this is just an artefact of a mistake transcribing the psABI? AArch64 is certainly not providing that guarantee just now.