So followup .... For those on the list: Anthony and I had a chat and we agree that a better thing to do is to have all cpu_physical_memory_* accesses to be ordered in program order from the perspective of the VCPUs. Devices that have performance critical accesses and want to do home made ordering can use map/unmap.
Now looking at the code, however, there seem to be a lot of duplication, ie cpu_physical_memory_rw() is an obvious choice to add a barrier but what about all of the ldl_*, ldq_* etc... ? In fact there's about 45 different ways code can dig into guest memory, should they all be made ordered ? At this point, it might be easier to just stick a barrier in qemu_get_ram_ptr() which seems to be called by everybody however that means that things like cpu_physical_memory_rw() will end up hitting the barrier for every page. It's safe but it might be a performance hit (measurable ? I can give it a try, probably not). Or we can just sprinkle the barrier everywhere, mostly it's going to be in exec.c, all the "ram" cases in ld*_* and st*_*. Also, should I make the barrier conditional to kvm_enabled() ? IE. It's pointless in full emulation and might actually be a performance hit on something already quite slow... Cheers, Ben.