RFC[0] -> RFCv2: * At Igor's suggestion in one of the patches I reworked the series enterily, and more or less as he was thinking it is far simpler to relocate the ram-above-4g to be at 1TiB where applicable. The changeset is 3x simpler, and less intrusive. (patch 1 & 2) * Check phys-bits is big enough prior to relocating (new patch 3) * Remove the machine property, and it's only internal and set by new machine version (Igor, patch 4). * Clarify whether it's GPA or HPA as a more clear meaning (Igor, patch 2) * Add IOMMU SDM in the commit message (Igor, patch 2)
Note: It still makes me a tiny bit unconfortable to just remove memory from [4G - 1010G] range, but it's a little baseless. It's definitely a lot better to maintain this set given its simplicity. For long term ideas proposed here, perhaps a Igor's pc-dimm based model idea or equivalent's Alex's suggestion of an option to control reserved address ranges could enable adjusting the 1Tb hole to be closer to baremetal. The one downside of this approach is CMOS loosing its meaning of the above 4G ram blocks, but it was mentioned over RFC that CMOS is only useful for very old seabios. If so, either I leave it as is, or perhaps folks prefer that I just set the ram above 4G in CMOS as 0. [0] https://lore.kernel.org/qemu-devel/20210622154905.30858-1-joao.m.mart...@oracle.com/ --- This series lets Qemu properly spawn i386 guests with >= 1010G with VFIO, particularly when running on AMD systems with an IOMMU. Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it will return -EINVAL on those cases. On x86, Intel hosts aren't particularly affected by this extra validation. But AMD systems with IOMMU have a hole in the 1TB boundary which is *reserved* for HyperTransport I/O addresses located here: FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean. VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return -EINVAL, consequently failing the creation the guests bigger than 1010G. Example of the failure: qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22 qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: failed to setup container for group 258: memory listener initialization failed: Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument) Prior to v5.4, we could map to these IOVAs *but* that's still not the right thing to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2]) as documented on the links down below. This small series tries to address that by dealing with this AMD-specific 1Tb hole, but rather than dealing like the 4G hole, it instead relocates RAM above 4G to be above the 1T if the maximum RAM range crosses the HT reserved range. It is organized as following: patch 1: Introduce a @above_4g_mem_start which defaults to 4 GiB as starting address of the 4G boundary patch 2: Change @above_4g_mem_start to 1TiB /if we are on AMD and the max possible address acrosses the HT region. patch 3: Warns user if phys-bits is too low patch 4: Ensure valid IOVAs only on new machine types, but not older ones (<= v6.2.0) The 'consequence' of this approach is that we may need more than the default phys-bits e.g. a guest with >1010G, will have most of its RAM after the 1TB address, consequently needing 41 phys-bits as opposed to the default of 40 (TCG_PHYS_BITS). Today there's already a precedent to depend on the user to pick the right value of phys-bits (regardless of this series), so we warn in case phys-bits aren't enough. Additionally, the reserved region is added to E820 if the relocation is done. Alternative options considered (RFCv1): a) Dealing with the 1T hole like the 4G hole -- which also represents what hardware closely does. Thanks, Joao [1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf [2] https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf Joao Martins (4): hw/i386: add 4g boundary start to X86MachineState i386/pc: relocate 4g start to 1T where applicable i386/pc: warn if phys-bits is too low i386/pc: Restrict AMD-only enforcing of valid IOVAs to new machine type hw/i386/acpi-build.c | 2 +- hw/i386/pc.c | 87 +++++++++++++++++++++++++++++++++++++++++-- hw/i386/pc_piix.c | 2 + hw/i386/pc_q35.c | 2 + hw/i386/sgx.c | 2 +- hw/i386/x86.c | 1 + include/hw/i386/pc.h | 1 + include/hw/i386/x86.h | 3 ++ target/i386/cpu.h | 4 ++ 9 files changed, 98 insertions(+), 6 deletions(-) -- 2.17.2