* Alex Williamson (alex.william...@redhat.com) wrote: > On Wed, 23 Jun 2021 10:30:29 +0100 > Joao Martins <joao.m.mart...@oracle.com> wrote: > > > On 6/22/21 10:16 PM, Alex Williamson wrote: > > > On Tue, 22 Jun 2021 16:48:59 +0100 > > > Joao Martins <joao.m.mart...@oracle.com> wrote: > > > > > >> Hey, > > >> > > >> This series lets Qemu properly spawn i386 guests with >= 1Tb with VFIO, > > >> particularly > > >> when running on AMD systems with an IOMMU. > > >> > > >> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is > > >> valid and it > > >> will return -EINVAL on those cases. On x86, Intel hosts aren't > > >> particularly > > >> affected by this extra validation. But AMD systems with IOMMU have a > > >> hole in > > >> the 1TB boundary which is *reserved* for HyperTransport I/O addresses > > >> located > > >> here FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically > > >> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses > > >> mean. > > >> > > >> VFIO DMA_MAP calls in this IOVA address range fall through this check > > >> and hence return > > >> -EINVAL, consequently failing the creation the guests bigger than > > >> 1010G. Example > > >> of the failure: > > >> > > >> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: > > >> VFIO_MAP_DMA: -22 > > >> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: > > >> vfio 0000:41:10.1: > > >> failed to setup container for group 258: memory listener initialization > > >> failed: > > >> Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, > > >> 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument) > > >> > > >> Prior to v5.4, we could map using these IOVAs *but* that's still not the > > >> right thing > > >> to do and could trigger certain IOMMU events (e.g. > > >> INVALID_DEVICE_REQUEST), or > > >> spurious guest VF failures from the resultant IOMMU target abort (see > > >> Errata 1155[2]) > > >> as documented on the links down below. > > >> > > >> This series tries to address that by dealing with this AMD-specific 1Tb > > >> hole, > > >> similarly to how we deal with the 4G hole today in x86 in general. It is > > >> splitted > > >> as following: > > >> > > >> * patch 1: initialize the valid IOVA ranges above 4G, adding an iterator > > >> which gets used too in other parts of pc/acpi besides MR > > >> creation. The > > >> allowed IOVA *only* changes if it's an AMD host, so no change for > > >> Intel. We walk the allowed ranges for memory above 4G, and > > >> add a E820_RESERVED type everytime we find a hole (which is at the > > >> 1TB boundary). > > >> > > >> NOTE: For purposes of this RFC, I rely on cpuid in hw/i386/pc.c but I > > >> understand that it doesn't cover the non-x86 host case running TCG. > > >> > > >> Additionally, an alternative to hardcoded ranges as we do today, > > >> VFIO could advertise the platform valid IOVA ranges without > > >> necessarily > > >> requiring to have a PCI device added in the vfio container. That > > >> would > > >> fetching the valid IOVA ranges from VFIO, rather than hardcoded IOVA > > >> ranges as we do today. But sadly, wouldn't work for older > > >> hypervisors. > > > > > > > > > $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u > > > 0x00000000fee00000 0x00000000feefffff msi > > > 0x000000fd00000000 0x000000ffffffffff reserved > > > > > Yeap, I am aware. > > > > The VFIO advertising extension was just because we already advertise the > > above info, > > although behind a non-empty vfio container e.g. we seem to use that for > > example in > > collect_usable_iova_ranges(). > > VFIO can't guess what groups you'll use to mark reserved ranges in an > empty container. Each group might have unique ranges. A container > enforcing ranges unrelated to the groups/devices in use doesn't make > sense. > > > > Ideally we might take that into account on all hosts, but of course > > > then we run into massive compatibility issues when we consider > > > migration. We run into similar problems when people try to assign > > > devices to non-x86 TCG hosts, where the arch doesn't have a natural > > > memory hole overlapping the msi range. > > > > > > The issue here is similar to trying to find a set of supported CPU > > > flags across hosts, QEMU only has visibility to the host where it runs, > > > an upper level tool needs to be able to pass through information about > > > compatibility to all possible migration targets. > > > > I agree with your generic sentiment (and idea) but are we sure this is > > really something as > > dynamic/needing-denominator like CPU Features? The memory map looks to be > > deeply embedded > > in the devices (ARM) or machine model (x86) that we pass in and doesn't > > change very often. > > pc/q35 is one very good example, because this hasn't changed since it's > > inception [a > > decade?] (and this limitation is there only for any multi-socket AMD > > machine with IOMMU > > with more than 1Tb). Additionally, there might be architectural impositions > > like on x86 > > e.g. CMOS seems to tie in with memory above certain boundaries. Unless by a > > migration > > targets, you mean to also cover you migrate between Intel and AMD hosts > > (which may need to > > keep the reserved range nonetheless in the common denominator) > > I like the flexibility that being able to specify reserved ranges would > provide, but I agree that the machine memory map is usually deeply > embedded into the arch code and would probably be difficult to > generalize. Cross vendor migration should be a consideration and only > an inter-system management policy could specify the importance of that.
On x86 at least, the cross vendor part doesn't seem to be an issue; I wouldn't expect an Intel->AMD migration to work reliably anyway. > Perhaps as David mentioned, this is really a machine type issue, where > the address width downsides you've noted might be sufficient reason > to introduce a new machine type that includes this memory hole. That > would likely be the more traditional solution to this issue. Thanks, To me this seems a combination of machine type+CPU model; perhaps what we're looking at here is having a list of holes, which can be contributed to by any of: a) The machine type b) The CPU model c) and extra command line option like you list Dave > Alex > > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK