Hi *, this came to my mind when browsing the sources in the patch's vicinity.
It is just a collection of thoughts, so please don't feel offended about how I phrased certain statements. Questions Is mr->opaque always unused ? i.e. should we assert NULL before assignment ? mr->ops vs. mr->iommu_ops i.e. can we set mr->opaque if mr->iommu_ops is not NULL ? or should we even assert mr->iommu_ops NULL, because a skip_dump mr is not supposed to be addr-translated again ? There is a _shared_ 'io_mem_unassigned' mr. Are we in danger to modify it ? Would that hurt ? Are we generally switching mrops "back and forth", or is this a first ? Can we afford not to implement size 8 or should we rather force 8 -> 2*4 by setting specific mrop flags if possible ? Or just hard code case 8: handle longword[1]; fallthru 4: When/where is memory_region_set_skip_dump() (supposed to be) called ? Recommendations Add comment in skip_dump_mem_read/write NOT to support 64b, because an error will not be recognised unless specific HW is present (maybe even give examples of specific HW combinations). Add comments at more code locations that are break-subpage/mmap-sensible. For example default vfio slow path mrops should also not support 64b ? Add a trace message for each mrop. Additional patch suggestion(s) During former investigations I found it not easy to identify runtime active/current mrops per mr, so: Add .name to mr->ops/iommu_ops to be able to mon-list them together with mr names OR (this questions flag reuse/overlay) skip_dump_flag should rather get a brother so (unamed) ops can be easily concluded for listing ? But is this the only mr<->mrop ambiguosity ? Regards, Thorsten Am 21.10.2016 um 19:11 schrieb Alex Williamson:
With a vfio assigned device we lay down a base MemoryRegion registered as an IO region, giving us read & write accessors. If the region supports mmap, we lay down a higher priority sub-region MemoryRegion on top of the base layer initialized as a RAM pointer to the mmap. Finally, if we have any quirks for the device (ie. address ranges that need additional virtualization support), we put another IO sub-region on top of the mmap MemoryRegion. When this is flattened, we now potentially have sub-page mmap MemoryRegions exposed which cannot be directly mapped through KVM. This is as expected, but a subtle detail of this is that we end up with two different access mechanisms through QEMU. If we disable the mmap MemoryRegion, we make use of the IO MemoryRegion and service accesses using pread and pwrite to the vfio device file descriptor. If the mmap MemoryRegion is enabled and we end up in one of these sub-page gaps, QEMU handles the access as RAM, using memcpy to the mmap. Using the mmap through QEMU is a subtle difference, but it's fine, the problem is the memcpy. My assumption is that memcpy makes no guarantees about access width and potentially uses all sorts of optimized memory transfers that are not intended for talking to device MMIO. It turns out that this has been a problem for Realtek NIC assignment, which has such a quirk that creates a sub-page mmap MemoryRegion access. My proposal to fix this is to leverage the skip_dump flag that we already use for special handling of these device-backed MMIO ranges. When skip_dump is set for a MemoryRegion, we mark memory access as non-direct and automatically insert MemoryRegionOps with basic semantics to handle accesses. Note that we only enable dword accesses because some devices don't particularly like qword accesses (Realtek NICs are such a device). This actually also fixes memory inspection via the xp command in the QEMU monitor as well. Please comment. Is this the best way to solve this problem? Thanks Reported-by: Thorsten Kohfeldt <thorsten.kohfe...@gmx.de> Signed-off-by: Alex Williamson <alex.william...@redhat.com> --- include/exec/memory.h | 6 ++++-- memory.c | 44 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 48 insertions(+), 2 deletions(-) diff --git a/include/exec/memory.h b/include/exec/memory.h index 10d7eac..a4c3acf 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -1464,9 +1464,11 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr); static inline bool memory_access_is_direct(MemoryRegion *mr, bool is_write) { if (is_write) { - return memory_region_is_ram(mr) && !mr->readonly; + return memory_region_is_ram(mr) && + !mr->readonly && !memory_region_is_skip_dump(mr); } else { - return memory_region_is_ram(mr) || memory_region_is_romd(mr); + return (memory_region_is_ram(mr) && !memory_region_is_skip_dump(mr)) || + memory_region_is_romd(mr); } } diff --git a/memory.c b/memory.c index 58f9269..7ed7ca9 100644 --- a/memory.c +++ b/memory.c @@ -1136,6 +1136,46 @@ const MemoryRegionOps unassigned_mem_ops = { .endianness = DEVICE_NATIVE_ENDIAN, }; +static uint64_t skip_dump_mem_read(void *opaque, hwaddr addr, unsigned size) +{ + uint64_t val = (uint64_t)~0; + + switch (size) { + case 1: + val = *(uint8_t *)(opaque + addr); + break; + case 2: + val = *(uint16_t *)(opaque + addr); + break; + case 4: + val = *(uint32_t *)(opaque + addr); + break; + } + + return val; +} + +static void skip_dump_mem_write(void *opaque, hwaddr addr, uint64_t data, unsigned size) +{ + switch (size) { + case 1: + *(uint8_t *)(opaque + addr) = (uint8_t)data; + break; + case 2: + *(uint16_t *)(opaque + addr) = (uint16_t)data; + break; + case 4: + *(uint32_t *)(opaque + addr) = (uint32_t)data; + break; + } +} + +const MemoryRegionOps skip_dump_mem_ops = { + .read = skip_dump_mem_read, + .write = skip_dump_mem_write, + .endianness = DEVICE_NATIVE_ENDIAN, +}; + bool memory_region_access_valid(MemoryRegion *mr, hwaddr addr, unsigned size, @@ -1366,6 +1406,10 @@ void memory_region_init_ram_ptr(MemoryRegion *mr, void memory_region_set_skip_dump(MemoryRegion *mr) { mr->skip_dump = true; + if (mr->ram && mr->ops == &unassigned_mem_ops) { + mr->ops = &skip_dump_mem_ops; + mr->opaque = mr->ram_block->host; + } } void memory_region_init_alias(MemoryRegion *mr,