= 1010G guests on AMD systems with IOMMU

Alex Williamson Thu, 23 Jun 2022 09:08:20 -0700

On Thu, 23 Jun 2022 00:18:06 +0100
Joao Martins <joao.m.mart...@oracle.com> wrote:


> On 6/22/22 23:37, Alex Williamson wrote:
> > On Fri, 20 May 2022 11:45:27 +0100
> > Joao Martins <joao.m.mart...@oracle.com> wrote:  
> >> v4[5] -> v5:
> >> * Fixed the 32-bit build(s) (patch 1, Michael Tsirkin)
> >> * Fix wrong reference (patch 4) to TCG_PHYS_BITS in code comment and
> >> commit message;
> >>
> >> ---
> >>
> >> This series lets Qemu spawn i386 guests with >= 1010G with VFIO,
> >> particularly when running on AMD systems with an IOMMU.
> >>
> >> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is 
> >> valid and it
> >> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
> >> affected by this extra validation. But AMD systems with IOMMU have a hole 
> >> in
> >> the 1TB boundary which is *reserved* for HyperTransport I/O addresses 
> >> located
> >> here: FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
> >> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses 
> >> mean.
> >>
> >> VFIO DMA_MAP calls in this IOVA address range fall through this check and 
> >> hence return
> >>  -EINVAL, consequently failing the creation the guests bigger than 1010G. 
> >> Example
> >> of the failure:
> >>
> >> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: 
> >> VFIO_MAP_DMA: -22
> >> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 
> >> 0000:41:10.1: 
> >>    failed to setup container for group 258: memory listener initialization 
> >> failed:
> >>            Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 
> >> 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
> >>
> >> Prior to v5.4, we could map to these IOVAs *but* that's still not the 
> >> right thing
> >> to do and could trigger certain IOMMU events (e.g. 
> >> INVALID_DEVICE_REQUEST), or
> >> spurious guest VF failures from the resultant IOMMU target abort (see 
> >> Errata 1155[2])
> >> as documented on the links down below.
> >>
> >> This small series tries to address that by dealing with this AMD-specific 
> >> 1Tb hole,
> >> but rather than dealing like the 4G hole, it instead relocates RAM above 4G
> >> to be above the 1T if the maximum RAM range crosses the HT reserved range.
> >> It is organized as following:
> >>
> >> patch 1: Introduce a @above_4g_mem_start which defaults to 4 GiB as 
> >> starting
> >>          address of the 4G boundary
> >>
> >> patches 2-3: Move pci-host qdev creation to be before pc_memory_init(),
> >>         to get accessing to pci_hole64_size. The actual pci-host
> >>         initialization is kept as is, only the qdev_new.
> >>
> >> patch 4: Change @above_4g_mem_start to 1TiB /if we are on AMD and the max
> >> possible address acrosses the HT region. Errors out if the phys-bits is too
> >> low, which is only the case for >=1010G configurations or something that
> >> crosses the HT region.
> >>
> >> patch 5: Ensure valid IOVAs only on new machine types, but not older
> >> ones (<= v7.0.0)
> >>
> >> The 'consequence' of this approach is that we may need more than the 
> >> default
> >> phys-bits e.g. a guest with >1010G, will have most of its RAM after the 1TB
> >> address, consequently needing 41 phys-bits as opposed to the default of 40
> >> (TCG_PHYS_ADDR_BITS). Today there's already a precedent to depend on the 
> >> user to
> >> pick the right value of phys-bits (regardless of this series), so we warn 
> >> in
> >> case phys-bits aren't enough. Finally, CMOS loosing its meaning of the 
> >> above 4G
> >> ram blocks, but it was mentioned over RFC that CMOS is only useful for very
> >> old seabios. 
> >>
> >> Additionally, the reserved region is added to E820 if the relocation is 
> >> done.  
> > 
> > I was helping a user on irc yesterday who was assigning a bunch of GPUs
> > on an AMD system and was not specifying an increased PCI hole and
> > therefore was not triggering the relocation.  The result was that the
> > VM doesn't know about this special range and given their guest RAM
> > size, firmware was mapping GPU BARs overlapping this reserved range
> > anyway.  I didn't see any evidence that this user was doing anything
> > like booting with pci=nocrs to blatantly ignore the firmware provided
> > bus resources.
> > 
> > To avoid this sort of thing, shouldn't this hypertransport range always
> > be marked reserved regardless of whether the relocation is done?
> >   
> Yeap, I think that's the right thing to do. We were alluding to that in patch 
> 4.
> 
> I can switch said patch to IS_AMD() together with a phys-bits check to add the
> range to e820.
> 
> But in practice, right now, this is going to be merely informative and doesn't
> change the outcome, as OVMF ignores reserved ranges if I understood that code
> correctly.

:-\

> relocation is most effective at avoiding this reserved-range overlapping issue
> on guests with less than a 1010GiB.

Do we need to do the relocation by default?

> > vfio-pci won't generate a fatal error when MMIO mappings fail, so this
> > scenario can be rather subtle.  NB, it also did not resolve this user's
> > problem to specify the PCI hole size and activate the relocation, so
> > this was not necessarily the issue they were fighting, but I noted it
> > as an apparent gap in this series.  Thanks,  
> 
> So I take it that even after the user expanded the PCI hole64 size and thus
> the GPU BARS were placed in a non-reserved range... still saw the MMIO
> mappings fail?

No, the mapping failures are resolved if the hole64 size is set, it's
just that there seem to be remaining issues that a device occasionally
gets into a bad state that isn't resolved by restarting the VM.
AFAICT, p2p mappings are not being used, so the faults were more of a
nuisance than actually contributing to the issues this user is working
through.  Thanks,

Alex

Re: [PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU

Reply via email to