On Fri, 2012-05-18 at 08:53 +0200, Paolo Bonzini wrote: > It depends on what semantics you attach to dma_mb. In my opinion, > having a separate barrier for DMA is wrong, because you want the same > semantics on all architectures. > > The x86 requirements are roughly as follows: > > 1) it never needs explicit rmb and wmb (as long as you don't use > non-temporal stores etc.); > > 2) synchronized operations have an implicit mb before and after (unlike > LL/SC on PowerPC). > > 3) everywhere else, you need an mb. > > So, on x86 you have more or less an implicit wmb after each write and an > implicit rmb before a read. This explains why these kind of bugs are > very hard to see on x86 (or often impossible to see).
So what you mean is that on x86, a read can pass a write (and vice-versa) ? Interesting.... I didn't know that. > Adding these in > cpu_physical_memory_rw has the advantage that x86 performance is not > affected, but it would not cover the case of a device model doing a DMA > read after a DMA write. Then the device model would need to issue a > smp_mb manually, on all architectures. I think this is too brittle. Ok. Agreed. I'll do a new patch on monday (I'm off for the week-end) that does that, using the existing smp_mb in cpu_physical_memory_rw(). I'm still tempted to add barriers in map and unmap as well in the case where they don't bounce to provide consistent semantics here, ie, all accesses done between the map and unmap are ordered vs all previous and subsequent accesses. Ok with that ? I will not add barriers to the various ld*/st* variants. > > If not (I'm trying to figure out why exactly does x86 have a barrier in > > the first place and when it's in order), then I might add a new barrier > > type in qemu-barriers.h, something like dma_mb(), and define it as a nop > > on x86, a lwsync or sync (still thinking about it) on ppc, and > > __sync_synchronize() on unknown archs. > > I don't think it is correct to think of this in terms of low-level > operations such as sync/lwsync. Rather, I think what we want is > sequentially consistent accesses; it's heavyweight, but you cannot go > wrong. http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html says this > is how you do it: > > x86 -> Load Seq_Cst: mov or mfence; mov > Store Seq Cst: mov; mfence or mov > > ARM -> Load Seq Cst: ldr; dmb or dmb; ldr; dmb > Store Seq Cst: dmb; str; dmb or dmb; str > > PPC -> Load Seq Cst: sync; ld; cmp; bc; isync > Store Seq Cst: sync; st > > where cmp; bc; isync can be replaced by sync. Hrm, the cmp/bc/isync can be -very- expensive, we use a variant of that using twi to enforce complete execution of reads in our readX() accessors in the kernel but I don't think I want to do that in qemu. The full sync should provide all the synchronization we need, the read trick is really only meant in the kernel to enforce timings (ie, a read followed by a delay, allows to make sure that the delay only starts "counting" after the read has completed, this is useful when talking to real HW which might have specific timing requirements). > and says "As far as the memory model is concerned, the ARM processor is > broadly similar to PowerPC, differing mainly in having a DMB barrier > (analogous to the PowerPC sync in its programmer-observable behaviour > for normal memory) and no analogue of the PowerPC lwsync". > > So one of the two ARM mappings, with smp_mb instead of dmb, is what we > want in cpu_physical_memory_rw. Device models that want to do better > can just use cpu_physical_memory_map, or we can add a > cpu_physical_memory_rw_direct for them. Cheers, Ben.