Il 18/09/2013 13:56, Peter Maydell ha scritto: >> > But does guest code actually care? In many cases, I suspect that >> > sticking a smp_rmb() in the read side of "unlocked" register accesses, >> > and a smp_wmb() in the write side, will do just fine. And add a >> > compatibility property to place a device back under the BQL for guests >> > that have problems. > Yuck. This sounds like a recipe for spending the next five years > debugging subtle race conditions. We need to continue to support > the semantics that the architecture and hardware specs define > for memory access orderings to emulated devices.
We cannot in the general case, QEMU is not a cycle-exact simulator. You need to look at the particular case. And if you look at particular cases, you'll find many that are already broken now. For example, we already have no such guarantee for RAM BARs when running under KVM, because accesses do not go through QEMU and are not serialized by the BQL. Or you could have a device with an MSI vector, program it to write to RAM, and poll the RAM location from the guest. Such a write would currently not be ordered with previous DMA from the device, which contradicts the PCI spec. (This is a bug and can be fixed). address_space_map/unmap pretty much breaks any DMA that is concurrent with control register access (e.g. the PCI command register). And all these cases are already there! Moving devices outside the BQL of course generates more of them. But it's not like everything is broken. For example, ordering memory access to one emulated device from one CPU is handled naturally (in either TCG or KVM mode). Ordering of accesses from a CPU with those from the QEMU data-plane code is also handled simply with locks or memory barriers private to the device. With multiple VCPUs operating at the same time (e.g. the send path of a network driver on a VCPU, with the interrupts processed on another VCPU) the activities are likely not independent and the guest is doing its own synchronization anyway. It's more likely that they use a lock, but they can even do Dekker-style synchronization using MMIO registers and it will just work as long as MMIO read/write ops use atomic_mb_read/atomic_mb_set (i.e. as long as the bus ordering guarantees are implemented locally to the device). There's nothing magic, really. Both PV and real devices have been doing it forever by placing some registers in RAM instead of MMIO, and communicating synchronization points via interrupts and doorbell registers. But above all, devices have to request BQL-free MMIO explicitly. You do not have to use it at all, you can just use all the infrastructure to do unlocked bus-master DMA (which is anyway already broken from the ordering POV). You can limit BQL-free MMIO to PV devices, or to extremely simple devices, or to one or two highly-optimized registers. There is a huge gamut of choices, and no magic really. Paolo