= 1Tb guests on AMD systems with IOMMU

Alex Williamson Wed, 23 Jun 2021 12:28:18 -0700

On Wed, 23 Jun 2021 10:30:29 +0100
Joao Martins <joao.m.mart...@oracle.com> wrote:


> On 6/22/21 10:16 PM, Alex Williamson wrote:
> > On Tue, 22 Jun 2021 16:48:59 +0100
> > Joao Martins <joao.m.mart...@oracle.com> wrote:
> >   
> >> Hey,
> >>
> >> This series lets Qemu properly spawn i386 guests with >= 1Tb with VFIO, 
> >> particularly
> >> when running on AMD systems with an IOMMU.
> >>
> >> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is 
> >> valid and it
> >> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
> >> affected by this extra validation. But AMD systems with IOMMU have a hole 
> >> in
> >> the 1TB boundary which is *reserved* for HyperTransport I/O addresses 
> >> located
> >> here  FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
> >> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses 
> >> mean.
> >>
> >> VFIO DMA_MAP calls in this IOVA address range fall through this check and 
> >> hence return
> >>  -EINVAL, consequently failing the creation the guests bigger than 1010G. 
> >> Example
> >> of the failure:
> >>
> >> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: 
> >> VFIO_MAP_DMA: -22
> >> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 
> >> 0000:41:10.1: 
> >>    failed to setup container for group 258: memory listener initialization 
> >> failed:
> >>            Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 
> >> 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
> >>
> >> Prior to v5.4, we could map using these IOVAs *but* that's still not the 
> >> right thing
> >> to do and could trigger certain IOMMU events (e.g. 
> >> INVALID_DEVICE_REQUEST), or
> >> spurious guest VF failures from the resultant IOMMU target abort (see 
> >> Errata 1155[2])
> >> as documented on the links down below.
> >>
> >> This series tries to address that by dealing with this AMD-specific 1Tb 
> >> hole,
> >> similarly to how we deal with the 4G hole today in x86 in general. It is 
> >> splitted
> >> as following:
> >>
> >> * patch 1: initialize the valid IOVA ranges above 4G, adding an iterator
> >>            which gets used too in other parts of pc/acpi besides MR 
> >> creation. The
> >>       allowed IOVA *only* changes if it's an AMD host, so no change for
> >>       Intel. We walk the allowed ranges for memory above 4G, and
> >>       add a E820_RESERVED type everytime we find a hole (which is at the
> >>       1TB boundary).
> >>       
> >>       NOTE: For purposes of this RFC, I rely on cpuid in hw/i386/pc.c but I
> >>       understand that it doesn't cover the non-x86 host case running TCG.
> >>
> >>       Additionally, an alternative to hardcoded ranges as we do today,
> >>       VFIO could advertise the platform valid IOVA ranges without 
> >> necessarily
> >>       requiring to have a PCI device added in the vfio container. That 
> >> would
> >>       fetching the valid IOVA ranges from VFIO, rather than hardcoded IOVA
> >>       ranges as we do today. But sadly, wouldn't work for older 
> >> hypervisors.  
> > 
> > 
> > $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u
> > 0x00000000fee00000 0x00000000feefffff msi
> > 0x000000fd00000000 0x000000ffffffffff reserved
> >   
> Yeap, I am aware.
> 
> The VFIO advertising extension was just because we already advertise the 
> above info,
> although behind a non-empty vfio container e.g. we seem to use that for 
> example in
> collect_usable_iova_ranges().

VFIO can't guess what groups you'll use to mark reserved ranges in an
empty container.  Each group might have unique ranges.  A container
enforcing ranges unrelated to the groups/devices in use doesn't make
sense.
 
> > Ideally we might take that into account on all hosts, but of course
> > then we run into massive compatibility issues when we consider
> > migration.  We run into similar problems when people try to assign
> > devices to non-x86 TCG hosts, where the arch doesn't have a natural
> > memory hole overlapping the msi range.
> > 
> > The issue here is similar to trying to find a set of supported CPU
> > flags across hosts, QEMU only has visibility to the host where it runs,
> > an upper level tool needs to be able to pass through information about
> > compatibility to all possible migration targets.  
> 
> I agree with your generic sentiment (and idea) but are we sure this is really 
> something as
> dynamic/needing-denominator like CPU Features? The memory map looks to be 
> deeply embedded
> in the devices (ARM) or machine model (x86) that we pass in and doesn't 
> change very often.
> pc/q35 is one very good example, because this hasn't changed since it's 
> inception [a
> decade?] (and this limitation is there only for any multi-socket AMD machine 
> with IOMMU
> with more than 1Tb). Additionally, there might be architectural impositions 
> like on x86
> e.g. CMOS seems to tie in with memory above certain boundaries. Unless by a 
> migration
> targets, you mean to also cover you migrate between Intel and AMD hosts 
> (which may need to
> keep the reserved range nonetheless in the common denominator)

I like the flexibility that being able to specify reserved ranges would
provide, but I agree that the machine memory map is usually deeply
embedded into the arch code and would probably be difficult to
generalize.  Cross vendor migration should be a consideration and only
an inter-system management policy could specify the importance of that.

Perhaps as David mentioned, this is really a machine type issue, where
the address width downsides you've noted might be sufficient reason
to introduce a new machine type that includes this memory hole.  That
would likely be the more traditional solution to this issue.  Thanks,

Alex

Re: [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU

Reply via email to