Sorry for the long mail and for what's probably an FAQ. I did try to find an answer without bothering the list... (and showing my ignorance so much :-))
At the moment, the s390 backend treats all atomic loads as simple loads and only uses serialisation instructions for atomic stores. I just wanted to check whether this was really the right behaviour. The architecture has strong memory-ordering semantics in which a CPU is only allowed to move a store after a later load; the other three combinations cannot happen. The current implementation seems fine from that point of view, because it means that a serialising instruction after a store is enough to prevent any reordering. However, page 5-126 of the architecture manual[*] says: Following is an example showing the effects of serialization. Location A initially contains FF hex. CPU 1 CPU 2 MVI A,X'00' G CLI A,X'00' BCR 15,0 BNE G The BCR 15,0 instruction executed by CPU 1 is a serializing instruction that ensures that the store by CPU 1 at location A is completed. However, CPU 2 may loop indefinitely, or until the next interruption on CPU 2, because CPU 2 may already have fetched from location A for every execution of the CLI instruction. A serializing instruction must be in the CPU-2 loop to ensure that CPU 2 will again fetch from location A. Does the new C/C++ memory model allow that kind of infinite loop even for sequentially-consistent atomic loads? The draft text was: [29.3.3] There shall be a single total order S on all memory_order_seq_cst operations, consistent with the “happens before” order and modification orders for all affected locations, such that each memory_order_seq_cst operation that loads a value observes either the last preceding modification according to this order S, or the result of an operation that is not memory_order_seq_cst. but when I asked around, noone could see anything in the standard that prevents the total order from having an infinite sequence of loads between two stores. That feels like a cheat though. :-) Even if it isn't allowed, every CPU is going to get interrupted eventually, and I'm told that in practice all current implementations would see the store at some point. In that case it might come down to a quality of implementation question. Is it OK to leave out the serialisation anyway with a slightly vague guarantee like that? Thanks, Richard [*] Available here FWIW: http://www-01.ibm.com/support/docview.wss?uid=isg2b9de5f05a9d57819852571c500428f9a