----- Original Message ----- > From: "Alex Williamson" <alex.william...@redhat.com> > To: qemu-devel@nongnu.org > Cc: pbonz...@redhat.com, "thorsten kohfeldt" <thorsten.kohfe...@gmx.de> > Sent: Friday, October 21, 2016 7:11:44 PM > Subject: [RFC PATCH] memory: Don't use memcpy for ram marked as skip_dump > > With a vfio assigned device we lay down a base MemoryRegion registered > as an IO region, giving us read & write accessors. If the region > supports mmap, we lay down a higher priority sub-region MemoryRegion > on top of the base layer initialized as a RAM pointer to the mmap. > Finally, if we have any quirks for the device (ie. address ranges that > need additional virtualization support), we put another IO sub-region > on top of the mmap MemoryRegion. When this is flattened, we now > potentially have sub-page mmap MemoryRegions exposed which cannot be > directly mapped through KVM. > > This is as expected, but a subtle detail of this is that we end up > with two different access mechanisms through QEMU. If we disable the > mmap MemoryRegion, we make use of the IO MemoryRegion and service > accesses using pread and pwrite to the vfio device file descriptor. > If the mmap MemoryRegion is enabled and we end up in one of these > sub-page gaps, QEMU handles the access as RAM, using memcpy to the > mmap. Using the mmap through QEMU is a subtle difference, but it's > fine, the problem is the memcpy. My assumption is that memcpy makes > no guarantees about access width and potentially uses all sorts of > optimized memory transfers that are not intended for talking to device > MMIO. It turns out that this has been a problem for Realtek NIC > assignment, which has such a quirk that creates a sub-page mmap > MemoryRegion access. > > My proposal to fix this is to leverage the skip_dump flag that we > already use for special handling of these device-backed MMIO ranges. > When skip_dump is set for a MemoryRegion, we mark memory access as > non-direct and automatically insert MemoryRegionOps with basic > semantics to handle accesses. Note that we only enable dword > accesses because some devices don't particularly like qword accesses > (Realtek NICs are such a device). This actually also fixes memory > inspection via the xp command in the QEMU monitor as well. > > Please comment. Is this the best way to solve this problem? Thanks
Looks good to me. Paolo > > Reported-by: Thorsten Kohfeldt <thorsten.kohfe...@gmx.de> > Signed-off-by: Alex Williamson <alex.william...@redhat.com> > --- > include/exec/memory.h | 6 ++++-- > memory.c | 44 ++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 48 insertions(+), 2 deletions(-) > > diff --git a/include/exec/memory.h b/include/exec/memory.h > index 10d7eac..a4c3acf 100644 > --- a/include/exec/memory.h > +++ b/include/exec/memory.h > @@ -1464,9 +1464,11 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t > addr); > static inline bool memory_access_is_direct(MemoryRegion *mr, bool is_write) > { > if (is_write) { > - return memory_region_is_ram(mr) && !mr->readonly; > + return memory_region_is_ram(mr) && > + !mr->readonly && !memory_region_is_skip_dump(mr); > } else { > - return memory_region_is_ram(mr) || memory_region_is_romd(mr); > + return (memory_region_is_ram(mr) && !memory_region_is_skip_dump(mr)) > || > + memory_region_is_romd(mr); > } > } > > diff --git a/memory.c b/memory.c > index 58f9269..7ed7ca9 100644 > --- a/memory.c > +++ b/memory.c > @@ -1136,6 +1136,46 @@ const MemoryRegionOps unassigned_mem_ops = { > .endianness = DEVICE_NATIVE_ENDIAN, > }; > > +static uint64_t skip_dump_mem_read(void *opaque, hwaddr addr, unsigned size) > +{ > + uint64_t val = (uint64_t)~0; > + > + switch (size) { > + case 1: > + val = *(uint8_t *)(opaque + addr); > + break; > + case 2: > + val = *(uint16_t *)(opaque + addr); > + break; > + case 4: > + val = *(uint32_t *)(opaque + addr); > + break; > + } > + > + return val; > +} > + > +static void skip_dump_mem_write(void *opaque, hwaddr addr, uint64_t data, > unsigned size) > +{ > + switch (size) { > + case 1: > + *(uint8_t *)(opaque + addr) = (uint8_t)data; > + break; > + case 2: > + *(uint16_t *)(opaque + addr) = (uint16_t)data; > + break; > + case 4: > + *(uint32_t *)(opaque + addr) = (uint32_t)data; > + break; > + } > +} > + > +const MemoryRegionOps skip_dump_mem_ops = { > + .read = skip_dump_mem_read, > + .write = skip_dump_mem_write, > + .endianness = DEVICE_NATIVE_ENDIAN, > +}; > + > bool memory_region_access_valid(MemoryRegion *mr, > hwaddr addr, > unsigned size, > @@ -1366,6 +1406,10 @@ void memory_region_init_ram_ptr(MemoryRegion *mr, > void memory_region_set_skip_dump(MemoryRegion *mr) > { > mr->skip_dump = true; > + if (mr->ram && mr->ops == &unassigned_mem_ops) { > + mr->ops = &skip_dump_mem_ops; > + mr->opaque = mr->ram_block->host; > + } > } > > void memory_region_init_alias(MemoryRegion *mr, > >