https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697

--- Comment #37 from torvald at gcc dot gnu.org ---
(In reply to James Greenhalgh from comment #35)
> (In reply to torvald from comment #32)
> > (In reply to James Greenhalgh from comment #28)
> > > This also gives us an easier route to fixing any issues with the
> > > acquire/release __sync primitives (__sync_lock_test_and_set and
> > > __sync_lock_release) if we decide that these also need to be stronger than
> > > their C++11 equivalents.
> > 
> > I don't think we have another case of different __sync vs. __atomics
> > semantics in case of __sync_lock_test_and_set.  The current specification
> > makes it clear that this is an acquire barrier, and how it describes the
> > semantics (ie, loads and stores that are program-order before the acquire op
> > can move to after it) , this seems to be consistent with the effects C11
> > specifies for acquire MO (with perhaps the distinction that C11 is clear
> > that acquire needs to be paired with some release op to create an ordering
> > constraint).
> 
> I think that the question is which parts of a RMW operation with
> MEMMODEL_ACQUIRE semantics is ordered. My understanding is that in C++11
> MEMMODEL_ACQUIRE only applies to the "load" half of the operation. So an
> observer to:
> 
>   atomic_flag_test_and_set_explicit(foo, memory_order_acquire)
>   atomic_store_exlicit (bar, 1, memory_model_relaxed)

That depends on your observer.  When reasoning about C11, please let's use
complete examples (and ideally, run them through cppmem).  For example, if the
observation is done by a pair of relaxed-MO loads, bar==1 && foo==0 can be
observed:

int main() {
  atomic_int foo = 0; 
  atomic_int bar = 0; 
  {{{ {
    r1 = cas_strong(foo, 0, 1);
    bar.store(1, memory_order_relaxed);
  } ||| {
    r2 = bar.load(memory_order_relaxed).readsvalue(1); 
    r3 = foo.load(memory_order_relaxed).readsvalue(0); 
  } }}};
  return 0; }

> Is permitted to observe a write to bar before a write to foo (but not before
> the read from foo).

How does one observe a load in C11 (ie, so that "before the read from foo" is
defined)?  You can constrain the reads-from of a load, but the load itself is
not observable as an effect.

I think I get what you're trying to say regarding the "load half" but I would
phrase it differently: If there could be another release store to foo that the
CAS/TAS reads-from, then moving the store to bar before the load would risk
creating a cycle between inter-thread happens-before and
sequenced-before-created intra-thread happens-before.  (Note that the later
isn't visible to other threads though, except through additional inter-thread
synchronization.)

Perhaps one could also say that moving the store to bar before the store to foo
could result in the CAS/TAS not being atomic -- but this is just reliably
observable if the foo observation is a CAS/TAS by itself (so it's totally
ordered with the other CAS), or if there is a inter-thread happens-before
between the observation of bar and the store to bar (which means using a
release store for bar).

I'm discussing these aspects because I think it matters which observations are
actually possible in a reliable way.  It matters for C11, and it may matter for
the __sync semantics as well because while the __sync functions promise TSO, we
don't really do so for all (nonatomic) loads or stores anywhere else in the
program...

IOW, can we actually come up with reliable and correct counter-examples (either
using the C11 or __sync interfaces) for the guarantees we think ARM might not
provide?

> My reading of the Itanium ABI is that the acquire barrier applies to the
> entire operation (Andrew, I think you copied these over exactly backwards in
> comment 34 ;) ):
> 
>   "Disallows the movement of memory references to visible data from
>    after the intrinsic (in program order) to before the intrinsic (this
>    behavior is desirable at lock-acquire operations, hence the name)."
> 
> The definition of __sync_lock_test_and_set is:
> 
>   "Behavior:
>    • Atomically store the supplied value in *ptr and return the old value
>      of *ptr. (i.e.)
>        { tmp = *ptr; *ptr = value; return tmp; }
>    • Acquire barrier."

Hmm.  I don't think this list is meant to be a sequence; __sync_lock_release
has two items, first the assignment to the memory location, and *second* a
release barrier.  If this were supposed to be a sequence, it wouldn't be
correct for lock releases.  It rather seems that, somehow, the whole intrinsic
is supposed to be a barrier, and be atomic.  Similarly, the __sync RMW ops just
have a single full barrier listed under behavior.

> So by the strict letter of the specification, no memory references to
> visible data should be allowed to move from after the entire body of the
> intrinsic to before it. That is to say in:
> 
>   __sync_lock_test_and_set (foo, 1)
>   bar = 1
> 
> an observer should not be able to observe the write to bar before the write
> to foo. This is a difference from the C++11 semantics.

Can you clarify how this observer would look like?  I think we should look at
both code examples that just use __sync for concurrent accesses, and examples
that also use normal memory accesses as if those would be guaranteed to be
atomic.  None of the __sync docs nor the psABI guarantee any of the latter
AFAIK.  I don't think it would be unreasonable to argue that __sync doesn't
make promises about non-DRF normal accesses, and so, strictly speaking, maybe
programs can't in fact rely on any of the behaviors that are complicated to
implement for ARM.  That's why I'd like to distinguish those two cases.

Currently, I couldn't say with confidence which semantics __sync really should
have, and which semantics we can actually promise in practice.  It feels as if
we need to have a more thorough model / understanding and more detailed info
about the trade-offs before we can make a decision.  It feels kind of risky to
just make a quick decision here that seems alright but which we don't truly
understand -- OTOH, we want to deprecate __sync eventually, so maybe just going
the super-conservative route is most practical.

> I'm not worried about __sync_lock_release, I think the documentation is
> strong enough and unambiguous.

Are you aware that the GCC's __sync disallow store-store reordering across
__sync_lock_release, whereas the psABI docs don't?

Reply via email to