http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48126

--- Comment #4 from Michael K. Edwards <m.k.edwards at gmail dot com> 
2011-05-24 16:38:41 UTC ---
OK, that's a clear explanation of why the DMB is necessary in the case where
both the compare and the store succeed (neither branch is taken; at a higher
semantic level, a lock is acquired, if that's what the atomic is being used
for).  For future reference, I would appreciate having those ARM ARM
quotations, along with some indication of how load scheduling interacts with a
branch past a memory barrier.

Suppose that the next instruction after label "2" is a load.  On some ARMv7
implementations, I presume that this load can get issued speculatively as early
as label "1", due to the "bne 2f" branch shadow, which skips the trailing dmb. 
I gather that the intention is that, if this branch is not taken (and thus we
execute through the trailing dmb), the fetch results from the branch shadow
should be discarded, and the load re-issued (with, in a multi-core device, the
appropriate ordering guarantee with respect to the strex).

If this interpretation is more or less right, and the shipping silicon behaves
as intended, then the branch past the dmb may be harmless -- although I might
argue that it wastes memory bandwidth in what is usually the common case
(compare-and-swap succeeds) in exchange for a slight reduction in latency in
the other case.

Yet that's still not quite the documented semantics of the GCC compare-and-swap
primitive, which is supposed to be totally ordered whether or not the swap
succeeds.  When I write a lock-free algorithm, I may well rely on the guarantee
that the value read in the next line of code was actually fetched after the
value fetched by the ldrex.  In fact, I have code that does rely on this
guarantee; if thread A sees that thread B has altered the atomic location, then
it expects to be able to see all data that thread B wrote before issuing its
compare-and-swap.  Here's the problem case:

thread A:                 thread B:

dmb
<speculated load of Y>
                          store Y
                          dmb
                          ldrex X
                          cmp
                          bne (doesn't branch)
                          strex X
ldrex X
cmp
bne (branches)
load Y (speculated above)

Is there something I'm not seeing that prevents this?

Reply via email to