On Fri, 2012-05-18 at 10:57 +0200, Paolo Bonzini wrote: > > I'm still tempted to add barriers in map and unmap as well in the case > > where they don't bounce to provide consistent semantics here, ie, all > > accesses done between the map and unmap are ordered vs all previous and > > subsequent accesses. Ok with that ? [...] > > I will not add barriers to the various ld*/st* variants. > > In theory you would need a memory barrier before the first ld/st and one > after the last... considering virtio uses map/unmap, what about leaving > map/unmap and ld*_phys/st*_phys as the high performance unsafe API? > Then you can add barriers around ld*_pci_dma/st*_pci_dma.
So no, my idea is to make anybody using ld_* and st_* (non _dma) responsible for their own barriers. The _dma are implemented in term of cpu_physical_memory_rw so should inherit the barriers. As for map/unmap, there's an inconsistency since when it falls back to bounce buffering, it will get implicit barriers. My idea was to put a barrier before always, see blow. > >> x86 -> Load Seq_Cst: mov or mfence; mov > >> Store Seq Cst: mov; mfence or mov > >> > >> ARM -> Load Seq Cst: ldr; dmb or dmb; ldr; dmb > >> Store Seq Cst: dmb; str; dmb or dmb; str > >> > >> PPC -> Load Seq Cst: sync; ld; cmp; bc; isync > >> Store Seq Cst: sync; st > >> > >> where cmp; bc; isync can be replaced by sync. > > > > Hrm, the cmp/bc/isync can be -very- expensive, we use a variant of that > > using twi to enforce complete execution of reads in our readX() > > accessors in the kernel but I don't think I want to do that in qemu. > > Ah, ok, thanks for explaining what cmp;bc;isync really is. :) > > > The full sync should provide all the synchronization we need > > You mean "sync; ld; sync" for load and "sync; st" for store? That would > do, yes. No, just sync,ld That should be enough. Only if the device needs additional synchronization against the guest accessing directly map'ed device memory should it need more and that's something the device can deal with explicitly if ever. IE. If I put a barrier "before" in cpu_physical_memory_rw I ensure ordering vs all previous accesses. Anything using the low level ld/st accessors is responsible for their own barriers. virtio for example since they intentionally bypass the dma/iommu stuff. As for map/unmap, the idea is to add a barrier in map() as well in the non-bounce case (maybe not unmap, keep the "before" semantic). This keeps the semantic of map/unmap as a "whole" being ordered which makes sense. IE. They are "high performance" in that there is no barrier between individual accesses within the map/unmap sequence itself which is good, the device is responsible for that if needed, but ordering the whole block vs. previous accesses makes sense. That means that most users (like block devices) don't actually need to bother, ie they use map/unmap for AIO, the barrier in map provides synchronization with previous descriptor accesses and the barrier in cpu_physial_memory_rw orders the transfer vs. subsequent descriptor updates. (Assuming the transfer contains actual CPU stores which it can, if it ends up being real DMA under the hood then it's already ordered by the host kernel driver). Anyway, I'll post a patch on monday my time and we can discuss further. Cheers, Ben.