On Thu, 23 Jun 2022 00:18:06 +0100 Joao Martins <joao.m.mart...@oracle.com> wrote:
> On 6/22/22 23:37, Alex Williamson wrote: > > On Fri, 20 May 2022 11:45:27 +0100 > > Joao Martins <joao.m.mart...@oracle.com> wrote: > >> v4[5] -> v5: > >> * Fixed the 32-bit build(s) (patch 1, Michael Tsirkin) > >> * Fix wrong reference (patch 4) to TCG_PHYS_BITS in code comment and > >> commit message; > >> > >> --- > >> > >> This series lets Qemu spawn i386 guests with >= 1010G with VFIO, > >> particularly when running on AMD systems with an IOMMU. > >> > >> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is > >> valid and it > >> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly > >> affected by this extra validation. But AMD systems with IOMMU have a hole > >> in > >> the 1TB boundary which is *reserved* for HyperTransport I/O addresses > >> located > >> here: FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically > >> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses > >> mean. > >> > >> VFIO DMA_MAP calls in this IOVA address range fall through this check and > >> hence return > >> -EINVAL, consequently failing the creation the guests bigger than 1010G. > >> Example > >> of the failure: > >> > >> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: > >> VFIO_MAP_DMA: -22 > >> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio > >> 0000:41:10.1: > >> failed to setup container for group 258: memory listener initialization > >> failed: > >> Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, > >> 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument) > >> > >> Prior to v5.4, we could map to these IOVAs *but* that's still not the > >> right thing > >> to do and could trigger certain IOMMU events (e.g. > >> INVALID_DEVICE_REQUEST), or > >> spurious guest VF failures from the resultant IOMMU target abort (see > >> Errata 1155[2]) > >> as documented on the links down below. > >> > >> This small series tries to address that by dealing with this AMD-specific > >> 1Tb hole, > >> but rather than dealing like the 4G hole, it instead relocates RAM above 4G > >> to be above the 1T if the maximum RAM range crosses the HT reserved range. > >> It is organized as following: > >> > >> patch 1: Introduce a @above_4g_mem_start which defaults to 4 GiB as > >> starting > >> address of the 4G boundary > >> > >> patches 2-3: Move pci-host qdev creation to be before pc_memory_init(), > >> to get accessing to pci_hole64_size. The actual pci-host > >> initialization is kept as is, only the qdev_new. > >> > >> patch 4: Change @above_4g_mem_start to 1TiB /if we are on AMD and the max > >> possible address acrosses the HT region. Errors out if the phys-bits is too > >> low, which is only the case for >=1010G configurations or something that > >> crosses the HT region. > >> > >> patch 5: Ensure valid IOVAs only on new machine types, but not older > >> ones (<= v7.0.0) > >> > >> The 'consequence' of this approach is that we may need more than the > >> default > >> phys-bits e.g. a guest with >1010G, will have most of its RAM after the 1TB > >> address, consequently needing 41 phys-bits as opposed to the default of 40 > >> (TCG_PHYS_ADDR_BITS). Today there's already a precedent to depend on the > >> user to > >> pick the right value of phys-bits (regardless of this series), so we warn > >> in > >> case phys-bits aren't enough. Finally, CMOS loosing its meaning of the > >> above 4G > >> ram blocks, but it was mentioned over RFC that CMOS is only useful for very > >> old seabios. > >> > >> Additionally, the reserved region is added to E820 if the relocation is > >> done. > > > > I was helping a user on irc yesterday who was assigning a bunch of GPUs > > on an AMD system and was not specifying an increased PCI hole and > > therefore was not triggering the relocation. The result was that the > > VM doesn't know about this special range and given their guest RAM > > size, firmware was mapping GPU BARs overlapping this reserved range > > anyway. I didn't see any evidence that this user was doing anything > > like booting with pci=nocrs to blatantly ignore the firmware provided > > bus resources. > > > > To avoid this sort of thing, shouldn't this hypertransport range always > > be marked reserved regardless of whether the relocation is done? > > > Yeap, I think that's the right thing to do. We were alluding to that in patch > 4. > > I can switch said patch to IS_AMD() together with a phys-bits check to add the > range to e820. > > But in practice, right now, this is going to be merely informative and doesn't > change the outcome, as OVMF ignores reserved ranges if I understood that code > correctly. :-\ > relocation is most effective at avoiding this reserved-range overlapping issue > on guests with less than a 1010GiB. Do we need to do the relocation by default? > > vfio-pci won't generate a fatal error when MMIO mappings fail, so this > > scenario can be rather subtle. NB, it also did not resolve this user's > > problem to specify the PCI hole size and activate the relocation, so > > this was not necessarily the issue they were fighting, but I noted it > > as an apparent gap in this series. Thanks, > > So I take it that even after the user expanded the PCI hole64 size and thus > the GPU BARS were placed in a non-reserved range... still saw the MMIO > mappings fail? No, the mapping failures are resolved if the hole64 size is set, it's just that there seem to be remaining issues that a device occasionally gets into a bad state that isn't resolved by restarting the VM. AFAICT, p2p mappings are not being used, so the faults were more of a nuisance than actually contributing to the issues this user is working through. Thanks, Alex