On 6/23/21 8:27 PM, Alex Williamson wrote: > On Wed, 23 Jun 2021 10:30:29 +0100 > Joao Martins <joao.m.mart...@oracle.com> wrote: > >> On 6/22/21 10:16 PM, Alex Williamson wrote: >>> On Tue, 22 Jun 2021 16:48:59 +0100 >>> Joao Martins <joao.m.mart...@oracle.com> wrote: >>> >>>> Hey, >>>> >>>> This series lets Qemu properly spawn i386 guests with >= 1Tb with VFIO, >>>> particularly >>>> when running on AMD systems with an IOMMU. >>>> >>>> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is >>>> valid and it >>>> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly >>>> affected by this extra validation. But AMD systems with IOMMU have a hole >>>> in >>>> the 1TB boundary which is *reserved* for HyperTransport I/O addresses >>>> located >>>> here FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically >>>> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses >>>> mean. >>>> >>>> VFIO DMA_MAP calls in this IOVA address range fall through this check and >>>> hence return >>>> -EINVAL, consequently failing the creation the guests bigger than 1010G. >>>> Example >>>> of the failure: >>>> >>>> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: >>>> VFIO_MAP_DMA: -22 >>>> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio >>>> 0000:41:10.1: >>>> failed to setup container for group 258: memory listener initialization >>>> failed: >>>> Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, >>>> 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument) >>>> >>>> Prior to v5.4, we could map using these IOVAs *but* that's still not the >>>> right thing >>>> to do and could trigger certain IOMMU events (e.g. >>>> INVALID_DEVICE_REQUEST), or >>>> spurious guest VF failures from the resultant IOMMU target abort (see >>>> Errata 1155[2]) >>>> as documented on the links down below. >>>> >>>> This series tries to address that by dealing with this AMD-specific 1Tb >>>> hole, >>>> similarly to how we deal with the 4G hole today in x86 in general. It is >>>> splitted >>>> as following: >>>> >>>> * patch 1: initialize the valid IOVA ranges above 4G, adding an iterator >>>> which gets used too in other parts of pc/acpi besides MR >>>> creation. The >>>> allowed IOVA *only* changes if it's an AMD host, so no change for >>>> Intel. We walk the allowed ranges for memory above 4G, and >>>> add a E820_RESERVED type everytime we find a hole (which is at the >>>> 1TB boundary). >>>> >>>> NOTE: For purposes of this RFC, I rely on cpuid in hw/i386/pc.c but I >>>> understand that it doesn't cover the non-x86 host case running TCG. >>>> >>>> Additionally, an alternative to hardcoded ranges as we do today, >>>> VFIO could advertise the platform valid IOVA ranges without >>>> necessarily >>>> requiring to have a PCI device added in the vfio container. That >>>> would >>>> fetching the valid IOVA ranges from VFIO, rather than hardcoded IOVA >>>> ranges as we do today. But sadly, wouldn't work for older >>>> hypervisors. >>> >>> >>> $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u >>> 0x00000000fee00000 0x00000000feefffff msi >>> 0x000000fd00000000 0x000000ffffffffff reserved >>> >> Yeap, I am aware. >> >> The VFIO advertising extension was just because we already advertise the >> above info, >> although behind a non-empty vfio container e.g. we seem to use that for >> example in >> collect_usable_iova_ranges(). > > VFIO can't guess what groups you'll use to mark reserved ranges in an > empty container. Each group might have unique ranges. A container > enforcing ranges unrelated to the groups/devices in use doesn't make > sense. > Hmm OK, I see. The suggestion/point was because AMD IOMMU seems to mark these reserved for both MSI range and the HyperTransport regardless of device/group specifics. See amd_iommu_get_resv_regions(). So I thought that something else for advertising platform-wide reserved ranges would be appropriate rather than replicating the same said info on the various groups reserved_regions sysfs entry. >>> Ideally we might take that into account on all hosts, but of course >>> then we run into massive compatibility issues when we consider >>> migration. We run into similar problems when people try to assign >>> devices to non-x86 TCG hosts, where the arch doesn't have a natural >>> memory hole overlapping the msi range. >>> >>> The issue here is similar to trying to find a set of supported CPU >>> flags across hosts, QEMU only has visibility to the host where it runs, >>> an upper level tool needs to be able to pass through information about >>> compatibility to all possible migration targets. >> >> I agree with your generic sentiment (and idea) but are we sure this is >> really something as >> dynamic/needing-denominator like CPU Features? The memory map looks to be >> deeply embedded >> in the devices (ARM) or machine model (x86) that we pass in and doesn't >> change very often. >> pc/q35 is one very good example, because this hasn't changed since it's >> inception [a >> decade?] (and this limitation is there only for any multi-socket AMD machine >> with IOMMU >> with more than 1Tb). Additionally, there might be architectural impositions >> like on x86 >> e.g. CMOS seems to tie in with memory above certain boundaries. Unless by a >> migration >> targets, you mean to also cover you migrate between Intel and AMD hosts >> (which may need to >> keep the reserved range nonetheless in the common denominator) > > I like the flexibility that being able to specify reserved ranges would > provide, /me nods > but I agree that the machine memory map is usually deeply > embedded into the arch code and would probably be difficult to > generalize. Cross vendor migration should be a consideration and only > an inter-system management policy could specify the importance of that. > > Perhaps as David mentioned, this is really a machine type issue, where > the address width downsides you've noted might be sufficient reason > to introduce a new machine type that includes this memory hole. That > would likely be the more traditional solution to this issue. Maybe there could be a generic facility that stores/manages the reserved ranges, and then the different machines can provide a default set depending on the running target heuristics (AMD x86, generic x86, ARM, etc). To some extent it means tracking reserved, rather tracking usable IOVA as I do here (and move that some where else that's not x86 specific).