With a vfio assigned device we lay down a base MemoryRegion registered as an IO region, giving us read & write accessors. If the region supports mmap, we lay down a higher priority sub-region MemoryRegion on top of the base layer initialized as a RAM pointer to the mmap. Finally, if we have any quirks for the device (ie. address ranges that need additional virtualization support), we put another IO sub-region on top of the mmap MemoryRegion. When this is flattened, we now potentially have sub-page mmap MemoryRegions exposed which cannot be directly mapped through KVM.
This is as expected, but a subtle detail of this is that we end up with two different access mechanisms through QEMU. If we disable the mmap MemoryRegion, we make use of the IO MemoryRegion and service accesses using pread and pwrite to the vfio device file descriptor. If the mmap MemoryRegion is enabled and we end up in one of these sub-page gaps, QEMU handles the access as RAM, using memcpy to the mmap. Using the mmap through QEMU is a subtle difference, but it's fine, the problem is the memcpy. My assumption is that memcpy makes no guarantees about access width and potentially uses all sorts of optimized memory transfers that are not intended for talking to device MMIO. It turns out that this has been a problem for Realtek NIC assignment, which has such a quirk that creates a sub-page mmap MemoryRegion access. My proposal to fix this is to leverage the skip_dump flag that we already use for special handling of these device-backed MMIO ranges. When skip_dump is set for a MemoryRegion, we mark memory access as non-direct and automatically insert MemoryRegionOps with basic semantics to handle accesses. Note that we only enable dword accesses because some devices don't particularly like qword accesses (Realtek NICs are such a device). This actually also fixes memory inspection via the xp command in the QEMU monitor as well. Please comment. Is this the best way to solve this problem? Thanks Reported-by: Thorsten Kohfeldt <thorsten.kohfe...@gmx.de> Signed-off-by: Alex Williamson <alex.william...@redhat.com> --- include/exec/memory.h | 6 ++++-- memory.c | 44 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 48 insertions(+), 2 deletions(-) diff --git a/include/exec/memory.h b/include/exec/memory.h index 10d7eac..a4c3acf 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -1464,9 +1464,11 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr); static inline bool memory_access_is_direct(MemoryRegion *mr, bool is_write) { if (is_write) { - return memory_region_is_ram(mr) && !mr->readonly; + return memory_region_is_ram(mr) && + !mr->readonly && !memory_region_is_skip_dump(mr); } else { - return memory_region_is_ram(mr) || memory_region_is_romd(mr); + return (memory_region_is_ram(mr) && !memory_region_is_skip_dump(mr)) || + memory_region_is_romd(mr); } } diff --git a/memory.c b/memory.c index 58f9269..7ed7ca9 100644 --- a/memory.c +++ b/memory.c @@ -1136,6 +1136,46 @@ const MemoryRegionOps unassigned_mem_ops = { .endianness = DEVICE_NATIVE_ENDIAN, }; +static uint64_t skip_dump_mem_read(void *opaque, hwaddr addr, unsigned size) +{ + uint64_t val = (uint64_t)~0; + + switch (size) { + case 1: + val = *(uint8_t *)(opaque + addr); + break; + case 2: + val = *(uint16_t *)(opaque + addr); + break; + case 4: + val = *(uint32_t *)(opaque + addr); + break; + } + + return val; +} + +static void skip_dump_mem_write(void *opaque, hwaddr addr, uint64_t data, unsigned size) +{ + switch (size) { + case 1: + *(uint8_t *)(opaque + addr) = (uint8_t)data; + break; + case 2: + *(uint16_t *)(opaque + addr) = (uint16_t)data; + break; + case 4: + *(uint32_t *)(opaque + addr) = (uint32_t)data; + break; + } +} + +const MemoryRegionOps skip_dump_mem_ops = { + .read = skip_dump_mem_read, + .write = skip_dump_mem_write, + .endianness = DEVICE_NATIVE_ENDIAN, +}; + bool memory_region_access_valid(MemoryRegion *mr, hwaddr addr, unsigned size, @@ -1366,6 +1406,10 @@ void memory_region_init_ram_ptr(MemoryRegion *mr, void memory_region_set_skip_dump(MemoryRegion *mr) { mr->skip_dump = true; + if (mr->ram && mr->ops == &unassigned_mem_ops) { + mr->ops = &skip_dump_mem_ops; + mr->opaque = mr->ram_block->host; + } } void memory_region_init_alias(MemoryRegion *mr,