Re: should sync builtins be full optimization barriers?

Geert Bosch Mon, 12 Sep 2011 18:52:50 -0700

On Sep 12, 2011, at 19:19, Andrew MacLeod wrote:

> Lets simplify it slightly.  The compiler can optimize away x=1 and x=3 as 
> dead stores (even valid on atomics!), leaving us with 2 modification orders..
>     2,4 or 4,2
> and what you are getting at is you don't think we should ever see
> r1==2, r2==4  and r3==4, r4==2
Right, I agree that the compiler can optimize away both the
double writes and double reads.
> 
> lets say the order of the writes turns out to be  2,4...  is it possible for 
> both writes to be travelling around some bus and have thread 4 actually read 
> the second one first, followed by the first one?   It would imply a lack of 
> memory coherency in the system wouldn't it? My simple understanding is that 
> the hardware gives us this sort of minimum guarantee on all shared memory. 
> which means we should never see that happen.


No, it is possible, and actually likely. Basically, the issue is write buffers. 
The coherency mechanisms come into play at a lower level in the hierarchy 
(typically at the last-level cache), which is why we need fences to start with 
to implement things like spin locks.

Threads running on the same CPU may be share the same caches (think about a 
thread switch or hyper-threading). Now both processors may have a copy of the 
same cache line and both try to do a write to some location in that line. Then 
they'll both try to get exclusive access to the cache line. One CPU will 
succeed, the other will have a cache miss.

However, while all this is going on, the write is just sitting in a write 
buffer, and any references from the same processor will just get forwarded the 
value of the outstanding write. 

> And if we can't see that, then I don't see how we can see your example..  
> *one* of those modification orders has to be what is actually written to x, 
> and reads from that memory location will not be able to see an something 
> else. (ie, if it was 1,2,3,4  then thread 4 would not be able to see 
> r3==4,r4==1 thanks to memory coherency.

No that's false. Even on systems with nice memory models, such as x86 and SPARC 
with a TSO model, you need a fence to avoid that a write-load of the same 
location is forced to make it all the way to coherent memory and not forwarded 
directly from the write buffer or L1 cache. The reasons that fences are 
expensive is exactly that it requires system-wide agreement.

 -Geert

Re: should sync builtins be full optimization barriers?

Reply via email to