On 02/10/19 16:58, Torvald Riegel wrote: > This example looks like Dekker synchronization (if I get the intent right).
It is the same pattern. However, one of the two synchronized variables is a counter rather than just a flag. > Two possible implementations of this are either (1) with all memory > accesses having seq-cst MO, or (2) with relaxed-MO accesses and seq-cst > fences on between the store and load on both ends. It's possible to mix > both, but that get's trickier I think. I'd prefer the one with just > fences, just because it's easiest, conceptually. Got it. I'd also prefer the one with just fences, because we only really control one side of the synchronization primitive (ctx_notify_me in my litmus test) and I don't like the idea of forcing seq-cst MO on the other side (bh_scheduled). The performance issue that I mentioned is that x86 doesn't have relaxed fetch and add, so you'd have a redundant fence like this: lock xaddl $2, mem1 mfence ... movl mem1, %r8 (Gory QEMU details however allow us to use relaxed load and store here, because there's only one writer). > It works if you use (1) or (2) consistently. cppmem and the Batty et al. > tech report should give you the gory details. > >> 1) understand why ATOMIC_SEQ_CST is not enough in this case. QEMU code >> seems to be making the same assumptions as Linux about the memory model, >> and this is wrong because QEMU uses C11 atomics if available. >> Fortunately, this kind of synchronization in QEMU is relatively rare and >> only this particular bit seems affected. If there is a fix which stays >> within the C11 memory model, and does not pessimize code on x86, we can >> use it[1] and document the pitfall. > > Using the fences between the store/load pairs in Dekker-like > synchronization should do that, right? It's also relatively easy to deal > with. > >> 2) if there's no way to fix the bug, qemu/atomic.h needs to switch to >> __sync_fetch_and_add and friends. And again, in this case the >> difference between the C11 and Linux/QEMU memory models must be documented. > > I surely not aware of all the constraints here, but I'd be surprised if the > C11 memory model isn't good enough for portable synchronization code (with > the exception of the consume MO minefield, perhaps). This helps a lot already; I'll work on a documentation and code patch. Thanks very much. Paolo >> int main() { >> atomic_int ctx_notify_me = 0; >> atomic_int bh_scheduled = 0; >> {{{ { >> bh_scheduled.store(1, mo_release); >> atomic_thread_fence(mo_seq_cst); >> // must be zero since the bug report shows no notification >> ctx_notify_me.load(mo_relaxed).readsvalue(0); >> } >> ||| { >> ctx_notify_me.store(2, mo_seq_cst); >> r2=bh_scheduled.load(mo_relaxed); >> } >> }}}; >> return 0; >> }