On 5/20/22 11:45, Joao Martins wrote: > v4[5] -> v5: > * Fixed the 32-bit build(s) (patch 1, Michael Tsirkin) > * Fix wrong reference (patch 4) to TCG_PHYS_BITS in code comment and > commit message; > > --- > > This series lets Qemu spawn i386 guests with >= 1010G with VFIO, > particularly when running on AMD systems with an IOMMU. > > Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid > and it > will return -EINVAL on those cases. On x86, Intel hosts aren't particularly > affected by this extra validation. But AMD systems with IOMMU have a hole in > the 1TB boundary which is *reserved* for HyperTransport I/O addresses located > here: FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically > section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean. > > VFIO DMA_MAP calls in this IOVA address range fall through this check and > hence return > -EINVAL, consequently failing the creation the guests bigger than 1010G. > Example > of the failure: > > qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: > VFIO_MAP_DMA: -22 > qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio > 0000:41:10.1: > failed to setup container for group 258: memory listener initialization > failed: > Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, > 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument) > > Prior to v5.4, we could map to these IOVAs *but* that's still not the right > thing > to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or > spurious guest VF failures from the resultant IOMMU target abort (see Errata > 1155[2]) > as documented on the links down below. > > This small series tries to address that by dealing with this AMD-specific 1Tb > hole, > but rather than dealing like the 4G hole, it instead relocates RAM above 4G > to be above the 1T if the maximum RAM range crosses the HT reserved range. > It is organized as following: > > patch 1: Introduce a @above_4g_mem_start which defaults to 4 GiB as starting > address of the 4G boundary > > patches 2-3: Move pci-host qdev creation to be before pc_memory_init(), > to get accessing to pci_hole64_size. The actual pci-host > initialization is kept as is, only the qdev_new. > > patch 4: Change @above_4g_mem_start to 1TiB /if we are on AMD and the max > possible address acrosses the HT region. Errors out if the phys-bits is too > low, which is only the case for >=1010G configurations or something that > crosses the HT region. > > patch 5: Ensure valid IOVAs only on new machine types, but not older > ones (<= v7.0.0) > > The 'consequence' of this approach is that we may need more than the default > phys-bits e.g. a guest with >1010G, will have most of its RAM after the 1TB > address, consequently needing 41 phys-bits as opposed to the default of 40 > (TCG_PHYS_ADDR_BITS). Today there's already a precedent to depend on the user > to > pick the right value of phys-bits (regardless of this series), so we warn in > case phys-bits aren't enough. Finally, CMOS loosing its meaning of the above > 4G > ram blocks, but it was mentioned over RFC that CMOS is only useful for very > old seabios. > > Additionally, the reserved region is added to E820 if the relocation is done. > > Alternative options considered (in RFC[0]): > > a) Dealing with the 1T hole like the 4G hole -- which also represents what > hardware closely does. > > Thanks, > Joao >
Ping? > Older Changelog, > > v3[4] -> v4[5]: > (changes in patch 4 and 5 only) > * Rebased to 7.1.0, hence move compat machine attribute to <= 7.0.0 versions > * Check guest vCPU vendor rather than host CPU vendor (Michael Tsirkin) > * Squash previous patch 5 into patch 4 to tie in the phys-bits check > into the relocate-4g-start logic: We now error out if the phys-bits > aren't enough on configurations that require above-4g ram relocation. > (Michael Tsirkin) > * Make the error message more explicit when phys-bits isn't enough to also > mention: "cannot avoid AMD HT range" > * Add comments inside x86_update_above_4g_mem_start() explaining the > logic behind it. (Michael Tsirkin) > * Tested on old guests old guests with Linux 2.6.32/3.10/4.14.35/4.1 based > kernels > alongside Win2008/2K12/2K16/2K19 on configs spanning 1T and 2T (Michael > Tsirkin) > Validated -numa topologies too as well as making sure qtests observe no > regressions; > > Notes from v4: > > * the machine attribute that enables this new logic (see last patch) > is called ::enforce_valid_iova since the RFC. Let me know if folks think it > is poorly named, and whether something a bit more obvious is preferred > (e.g. ::amd_relocate_1t). > > * @mst one of the comments you said was to add "host checks" in vdpa/vfio > devices. > In discussion with Alex and you over the last version of the patches it seems > that we weren't keen on making this device-specific or behind any machine > property flags (besides machine-compat). Just to reiterate there, making sure > we do > the above-4g relocation requiring properly sized phys-bits and AMD as vCPU > vendor (as this series) already ensures thtat this is going to be right for > offending configuration with VDPA/VFIO device that might be > configured/hotplugged. Unless you were thinking that somehow vfio/vdpa devices > start poking into machine-specific details when we fail to relocate due to the > lack of phys-bits? Otherwise Qemu, just doesn't have enough information to > tell > what's a valid IOVA or not, in which case kernel vhost-iotlb/vhost-vdpa is > the one > that needs fixing (as VFIO did in v5.4). > > RFCv2[3] -> v3[4]: > > * Add missing brackets in single line statement, in patch 5 (David) > * Change ranges printf to use PRIx64, in patch 5 (David) > * Move the check to after changing above_4g_mem_start, in patch 5 (David) > * Make the check generic and move it to pc_memory_init rather being specific > to AMD, as the check is useful to capture invalid phys-bits > configs (patch 5, Igor). > * Fix comment as 'Start address of the initial RAM above 4G' in patch 1 (Igor) > * Consider pci_hole64_size in patch 4 (Igor) > * To consider pci_hole64_size in max used addr we need to get it from > pci-host, > so introduce two new patches (2 and 3) which move only the qdev_new("i440fx") > or > qdev_new("q35") to be before pc_memory_init(). > * Consider sgx_epc.size in max used address, in patch 4 (Igor) > * Rename relocate_4g() to x86_update_above_4g_mem_start() (Igor) > * Keep warn_report() in patch 5, as erroring out will break a few x86_64 > qtests > due to pci_hole64 accounting surprass phys-bits possible maxphysaddr. > > RFC[0] -> RFCv2[3]: > > * At Igor's suggestion in one of the patches I reworked the series enterily, > and more or less as he was thinking it is far simpler to relocate the > ram-above-4g to be at 1TiB where applicable. The changeset is 3x simpler, > and less intrusive. (patch 1 & 2) > * Check phys-bits is big enough prior to relocating (new patch 3) > * Remove the machine property, and it's only internal and set by new machine > version (Igor, patch 4). > * Clarify whether it's GPA or HPA as a more clear meaning (Igor, patch 2) > * Add IOMMU SDM in the commit message (Igor, patch 2) > > [0] > https://lore.kernel.org/qemu-devel/20210622154905.30858-1-joao.m.mart...@oracle.com/ > [1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf > [2] https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf > [3] > https://lore.kernel.org/qemu-devel/20220207202422.31582-1-joao.m.mart...@oracle.com/T/#u > [4] > https://lore.kernel.org/all/20220223184455.9057-1-joao.m.mart...@oracle.com/ > [5] > https://lore.kernel.org/qemu-devel/20220420201138.23854-1-joao.m.mart...@oracle.com/ > > Joao Martins (5): > hw/i386: add 4g boundary start to X86MachineState > i386/pc: create pci-host qdev prior to pc_memory_init() > i386/pc: pass pci_hole64_size to pc_memory_init() > i386/pc: relocate 4g start to 1T where applicable > i386/pc: restrict AMD only enforcing of valid IOVAs to new machine > type > > hw/i386/acpi-build.c | 2 +- > hw/i386/pc.c | 126 +++++++++++++++++++++++++++++++++-- > hw/i386/pc_piix.c | 12 +++- > hw/i386/pc_q35.c | 14 +++- > hw/i386/sgx.c | 2 +- > hw/i386/x86.c | 1 + > hw/pci-host/i440fx.c | 10 ++- > include/hw/i386/pc.h | 4 +- > include/hw/i386/x86.h | 3 + > include/hw/pci-host/i440fx.h | 3 +- > 10 files changed, 161 insertions(+), 16 deletions(-) >