[Bug target/65697] atomic memory barriers not strong enough for sync builtins

jgreenhalgh at gcc dot gnu.org Tue, 28 Apr 2015 08:33:18 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697


--- Comment #43 from James Greenhalgh <jgreenhalgh at gcc dot gnu.org> ---
(In reply to torvald from comment #37)
> (In reply to James Greenhalgh from comment #35)
> > (In reply to torvald from comment #32)
> > > (In reply to James Greenhalgh from comment #28)
> > > > This also gives us an easier route to fixing any issues with the
> > > > acquire/release __sync primitives (__sync_lock_test_and_set and
> > > > __sync_lock_release) if we decide that these also need to be stronger 
> > > > than
> > > > their C++11 equivalents.
> > > 
> > > I don't think we have another case of different __sync vs. __atomics
> > > semantics in case of __sync_lock_test_and_set.  The current specification
> > > makes it clear that this is an acquire barrier, and how it describes the
> > > semantics (ie, loads and stores that are program-order before the acquire 
> > > op
> > > can move to after it) , this seems to be consistent with the effects C11
> > > specifies for acquire MO (with perhaps the distinction that C11 is clear
> > > that acquire needs to be paired with some release op to create an ordering
> > > constraint).
> > 
> > I think that the question is which parts of a RMW operation with
> > MEMMODEL_ACQUIRE semantics is ordered. My understanding is that in C++11
> > MEMMODEL_ACQUIRE only applies to the "load" half of the operation. So an
> > observer to:
> > 
> >   atomic_flag_test_and_set_explicit(foo, memory_order_acquire)
> >   atomic_store_exlicit (bar, 1, memory_model_relaxed)
>  
> That depends on your observer.  When reasoning about C11, please let's use
> complete examples (and ideally, run them through cppmem).  For example, if
> the observation is done by a pair of relaxed-MO loads, bar==1 && foo==0 can
> be observed:
> 
> int main() {
>   atomic_int foo = 0; 
>   atomic_int bar = 0; 
>   {{{ {
>     r1 = cas_strong(foo, 0, 1);
>     bar.store(1, memory_order_relaxed);
>   } ||| {
>     r2 = bar.load(memory_order_relaxed).readsvalue(1); 
>     r3 = foo.load(memory_order_relaxed).readsvalue(0); 
>   } }}};
>   return 0; }

Thanks for that, and sorry again for my sloppy terminology. I did try to write
a
cppmem example, but I couldn't find the syntax for a CAS. If I could have, this
is
what I would have written, and gets the results I had expected and was trying
to express.

> 
> > Is permitted to observe a write to bar before a write to foo (but not before
> > the read from foo).
> 
> How does one observe a load in C11 (ie, so that "before the read from foo"
> is defined)?  You can constrain the reads-from of a load, but the load
> itself is not observable as an effect.

Sorry, this is ARM memory model terminology leaking across to my C11 examples,
but
even then what I was saying doesn't make sense. To observe a load in the ARM
model:

"A read of a location in memory is said to be observed by an observer when a
 subsequent write to the location by the same observer has no effect on the
value
 returned by the read."

But you are right, this is not interesting to model.


> I think I get what you're trying to say regarding the "load half" but I
> would phrase it differently: If there could be another release store to foo
> that the CAS/TAS reads-from, then moving the store to bar before the load
> would risk creating a cycle between inter-thread happens-before and
> sequenced-before-created intra-thread happens-before.  (Note that the later
> isn't visible to other threads though, except through additional
> inter-thread synchronization.)
> 
> Perhaps one could also say that moving the store to bar before the store to
> foo could result in the CAS/TAS not being atomic -- but this is just
> reliably observable if the foo observation is a CAS/TAS by itself (so it's
> totally ordered with the other CAS), or if there is a inter-thread
> happens-before between the observation of bar and the store to bar (which
> means using a release store for bar).
> 
> I'm discussing these aspects because I think it matters which observations
> are actually possible in a reliable way.  It matters for C11, and it may
> matter for the __sync semantics as well because while the __sync functions
> promise TSO, we don't really do so for all (nonatomic) loads or stores
> anywhere else in the program...
> 
> IOW, can we actually come up with reliable and correct counter-examples
> (either using the C11 or __sync interfaces) for the guarantees we think ARM
> might not provide?
> 
> > My reading of the Itanium ABI is that the acquire barrier applies to the
> > entire operation (Andrew, I think you copied these over exactly backwards in
> > comment 34 ;) ):
> > 
> >   "Disallows the movement of memory references to visible data from
> >    after the intrinsic (in program order) to before the intrinsic (this
> >    behavior is desirable at lock-acquire operations, hence the name)."
> > 
> > The definition of __sync_lock_test_and_set is:
> > 
> >   "Behavior:
> >    • Atomically store the supplied value in *ptr and return the old value
> >      of *ptr. (i.e.)
> >        { tmp = *ptr; *ptr = value; return tmp; }
> >    • Acquire barrier."
> 
> Hmm.  I don't think this list is meant to be a sequence; __sync_lock_release
> has two items, first the assignment to the memory location, and *second* a
> release barrier.  If this were supposed to be a sequence, it wouldn't be
> correct for lock releases.  It rather seems that, somehow, the whole
> intrinsic is supposed to be a barrier, and be atomic.  Similarly, the __sync
> RMW ops just have a single full barrier listed under behavior.

Yes, I agree and that was the interpretation I was getting at. I believe that
this is a consequence of the Itanium "cmpxchg" instruction, which implements
the
intrinsic in one instruction, and thus does not need to care about its
decomposition.

> > So by the strict letter of the specification, no memory references to
> > visible data should be allowed to move from after the entire body of the
> > intrinsic to before it. That is to say in:
> > 
> >   __sync_lock_test_and_set (foo, 1)
> >   bar = 1
> > 
> > an observer should not be able to observe the write to bar before the write
> > to foo. This is a difference from the C++11 semantics.
> 
> Can you clarify how this observer would look like?  I think we should look
> at both code examples that just use __sync for concurrent accesses, and
> examples that also use normal memory accesses as if those would be
> guaranteed to be atomic.  None of the __sync docs nor the psABI guarantee
> any of the latter AFAIK.  I don't think it would be unreasonable to argue
> that __sync doesn't make promises about non-DRF normal accesses, and so,
> strictly speaking, maybe programs can't in fact rely on any of the behaviors
> that are complicated to implement for ARM.  That's why I'd like to
> distinguish those two cases.

Sure, it would look much like your cppmem example above, though you can add
some
barriers to the observer if you want to make a stronger example

bar = 0, foo = 0;

thread_a {
  __sync_lock_test_and_set (foo, 1)
  bar = 1
}

thread_b {
  /* If we can see the write to bar, the write
     to foo must also have happened.  */
  if (bar) /* Reads 1.  */
   assert (foo) /* Should never read 0.  */
}

> Currently, I couldn't say with confidence which semantics __sync really
> should have, and which semantics we can actually promise in practice.  It
> feels as if we need to have a more thorough model / understanding and more
> detailed info about the trade-offs before we can make a decision.  It feels
> kind of risky to just make a quick decision here that seems alright but
> which we don't truly understand -- OTOH, we want to deprecate __sync
> eventually, so maybe just going the super-conservative route is most
> practical.

Yes agreed.

> > I'm not worried about __sync_lock_release, I think the documentation is
> > strong enough and unambiguous.
> 
> Are you aware that the GCC's __sync disallow store-store reordering across
> __sync_lock_release, whereas the psABI docs don't?

No I was not, and even looking exactly for the text you were referring to, it
took
me three attempts to spot it. Yes, I understand now why you are concerned about
the GCC wording. Perhaps this is just an artefact of a mistake transcribing
the psABI?

AArch64 is certainly not providing that guarantee just now.

[Bug target/65697] __atomic memory barriers not strong enough for __sync builtins

Reply via email to

[Bug target/65697] atomic memory barriers not strong enough for sync builtins