On Sep 12, 2011, at 19:19, Andrew MacLeod wrote: > Lets simplify it slightly. The compiler can optimize away x=1 and x=3 as > dead stores (even valid on atomics!), leaving us with 2 modification orders.. > 2,4 or 4,2 > and what you are getting at is you don't think we should ever see > r1==2, r2==4 and r3==4, r4==2 Right, I agree that the compiler can optimize away both the double writes and double reads. > > lets say the order of the writes turns out to be 2,4... is it possible for > both writes to be travelling around some bus and have thread 4 actually read > the second one first, followed by the first one? It would imply a lack of > memory coherency in the system wouldn't it? My simple understanding is that > the hardware gives us this sort of minimum guarantee on all shared memory. > which means we should never see that happen.
No, it is possible, and actually likely. Basically, the issue is write buffers. The coherency mechanisms come into play at a lower level in the hierarchy (typically at the last-level cache), which is why we need fences to start with to implement things like spin locks. Threads running on the same CPU may be share the same caches (think about a thread switch or hyper-threading). Now both processors may have a copy of the same cache line and both try to do a write to some location in that line. Then they'll both try to get exclusive access to the cache line. One CPU will succeed, the other will have a cache miss. However, while all this is going on, the write is just sitting in a write buffer, and any references from the same processor will just get forwarded the value of the outstanding write. > And if we can't see that, then I don't see how we can see your example.. > *one* of those modification orders has to be what is actually written to x, > and reads from that memory location will not be able to see an something > else. (ie, if it was 1,2,3,4 then thread 4 would not be able to see > r3==4,r4==1 thanks to memory coherency. No that's false. Even on systems with nice memory models, such as x86 and SPARC with a TSO model, you need a fence to avoid that a write-load of the same location is forced to make it all the way to coherent memory and not forwarded directly from the write buffer or L1 cache. The reasons that fences are expensive is exactly that it requires system-wide agreement. -Geert