= 1010G guests on AMD systems with IOMMU

Joao Martins Wed, 08 Jun 2022 03:42:04 -0700

On 5/20/22 11:45, Joao Martins wrote:
> v4[5] -> v5:
> * Fixed the 32-bit build(s) (patch 1, Michael Tsirkin)
> * Fix wrong reference (patch 4) to TCG_PHYS_BITS in code comment and
> commit message;
> 
> ---
> 
> This series lets Qemu spawn i386 guests with >= 1010G with VFIO,
> particularly when running on AMD systems with an IOMMU.
> 
> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid 
> and it
> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
> affected by this extra validation. But AMD systems with IOMMU have a hole in
> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
> here: FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
> 
> VFIO DMA_MAP calls in this IOVA address range fall through this check and 
> hence return
>  -EINVAL, consequently failing the creation the guests bigger than 1010G. 
> Example
> of the failure:
> 
> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: 
> VFIO_MAP_DMA: -22
> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 
> 0000:41:10.1: 
>       failed to setup container for group 258: memory listener initialization 
> failed:
>               Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 
> 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
> 
> Prior to v5.4, we could map to these IOVAs *but* that's still not the right 
> thing
> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
> spurious guest VF failures from the resultant IOMMU target abort (see Errata 
> 1155[2])
> as documented on the links down below.
> 
> This small series tries to address that by dealing with this AMD-specific 1Tb 
> hole,
> but rather than dealing like the 4G hole, it instead relocates RAM above 4G
> to be above the 1T if the maximum RAM range crosses the HT reserved range.
> It is organized as following:
> 
> patch 1: Introduce a @above_4g_mem_start which defaults to 4 GiB as starting
>          address of the 4G boundary
> 
> patches 2-3: Move pci-host qdev creation to be before pc_memory_init(),
>            to get accessing to pci_hole64_size. The actual pci-host
>            initialization is kept as is, only the qdev_new.
> 
> patch 4: Change @above_4g_mem_start to 1TiB /if we are on AMD and the max
> possible address acrosses the HT region. Errors out if the phys-bits is too
> low, which is only the case for >=1010G configurations or something that
> crosses the HT region.
> 
> patch 5: Ensure valid IOVAs only on new machine types, but not older
> ones (<= v7.0.0)
> 
> The 'consequence' of this approach is that we may need more than the default
> phys-bits e.g. a guest with >1010G, will have most of its RAM after the 1TB
> address, consequently needing 41 phys-bits as opposed to the default of 40
> (TCG_PHYS_ADDR_BITS). Today there's already a precedent to depend on the user 
> to
> pick the right value of phys-bits (regardless of this series), so we warn in
> case phys-bits aren't enough. Finally, CMOS loosing its meaning of the above 
> 4G
> ram blocks, but it was mentioned over RFC that CMOS is only useful for very
> old seabios. 
> 
> Additionally, the reserved region is added to E820 if the relocation is done.
> 
> Alternative options considered (in RFC[0]):
> 
> a) Dealing with the 1T hole like the 4G hole -- which also represents what
> hardware closely does.
> 
> Thanks,
>       Joao
>


Ping?

> Older Changelog,
> 
> v3[4] -> v4[5]:
> (changes in patch 4 and 5 only)
> * Rebased to 7.1.0, hence move compat machine attribute to <= 7.0.0 versions
> * Check guest vCPU vendor rather than host CPU vendor (Michael Tsirkin)
> * Squash previous patch 5 into patch 4 to tie in the phys-bits check
>   into the relocate-4g-start logic: We now error out if the phys-bits
>   aren't enough on configurations that require above-4g ram relocation. 
> (Michael Tsirkin)
> * Make the error message more explicit when phys-bits isn't enough to also
>   mention: "cannot avoid AMD HT range"
> * Add comments inside x86_update_above_4g_mem_start() explaining the
>   logic behind it. (Michael Tsirkin)
> * Tested on old guests old guests with Linux 2.6.32/3.10/4.14.35/4.1 based 
> kernels
>   alongside Win2008/2K12/2K16/2K19 on configs spanning 1T and 2T (Michael 
> Tsirkin)
>   Validated -numa topologies too as well as making sure qtests observe no 
> regressions;
> 
>  Notes from v4:
> 
> * the machine attribute that enables this new logic (see last patch)
> is called ::enforce_valid_iova since the RFC. Let me know if folks think it
> is poorly named, and whether something a bit more obvious is preferred
> (e.g. ::amd_relocate_1t).
> 
> * @mst one of the comments you said was to add "host checks" in vdpa/vfio 
> devices.
> In discussion with Alex and you over the last version of the patches it seems
> that we weren't keen on making this device-specific or behind any machine
> property flags (besides machine-compat). Just to reiterate there, making sure 
> we do
> the above-4g relocation requiring properly sized phys-bits and AMD as vCPU
> vendor (as this series) already ensures thtat this is going to be right for
> offending configuration with VDPA/VFIO device that might be
> configured/hotplugged. Unless you were thinking that somehow vfio/vdpa devices
> start poking into machine-specific details when we fail to relocate due to the
> lack of phys-bits? Otherwise Qemu, just doesn't have enough information to 
> tell
> what's a valid IOVA or not, in which case kernel vhost-iotlb/vhost-vdpa is 
> the one
> that needs fixing (as VFIO did in v5.4).
> 
> RFCv2[3] -> v3[4]:
> 
> * Add missing brackets in single line statement, in patch 5 (David)
> * Change ranges printf to use PRIx64, in patch 5 (David)
> * Move the check to after changing above_4g_mem_start, in patch 5 (David)
> * Make the check generic and move it to pc_memory_init rather being specific
> to AMD, as the check is useful to capture invalid phys-bits
> configs (patch 5, Igor).
> * Fix comment as 'Start address of the initial RAM above 4G' in patch 1 (Igor)
> * Consider pci_hole64_size in patch 4 (Igor)
> * To consider pci_hole64_size in max used addr we need to get it from 
> pci-host,
> so introduce two new patches (2 and 3) which move only the qdev_new("i440fx") 
> or
> qdev_new("q35") to be before pc_memory_init().
> * Consider sgx_epc.size in max used address, in patch 4 (Igor)
> * Rename relocate_4g() to x86_update_above_4g_mem_start() (Igor)
> * Keep warn_report() in patch 5, as erroring out will break a few x86_64 
> qtests
> due to pci_hole64 accounting surprass phys-bits possible maxphysaddr.
> 
> RFC[0] -> RFCv2[3]:
> 
> * At Igor's suggestion in one of the patches I reworked the series enterily,
> and more or less as he was thinking it is far simpler to relocate the
> ram-above-4g to be at 1TiB where applicable. The changeset is 3x simpler,
> and less intrusive. (patch 1 & 2)
> * Check phys-bits is big enough prior to relocating (new patch 3)
> * Remove the machine property, and it's only internal and set by new machine
> version (Igor, patch 4).
> * Clarify whether it's GPA or HPA as a more clear meaning (Igor, patch 2)
> * Add IOMMU SDM in the commit message (Igor, patch 2)
> 
> [0] 
> https://lore.kernel.org/qemu-devel/20210622154905.30858-1-joao.m.mart...@oracle.com/
> [1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
> [2] https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
> [3] 
> https://lore.kernel.org/qemu-devel/20220207202422.31582-1-joao.m.mart...@oracle.com/T/#u
> [4] 
> https://lore.kernel.org/all/20220223184455.9057-1-joao.m.mart...@oracle.com/
> [5] 
> https://lore.kernel.org/qemu-devel/20220420201138.23854-1-joao.m.mart...@oracle.com/
> 
> Joao Martins (5):
>   hw/i386: add 4g boundary start to X86MachineState
>   i386/pc: create pci-host qdev prior to pc_memory_init()
>   i386/pc: pass pci_hole64_size to pc_memory_init()
>   i386/pc: relocate 4g start to 1T where applicable
>   i386/pc: restrict AMD only enforcing of valid IOVAs to new machine
>     type
> 
>  hw/i386/acpi-build.c         |   2 +-
>  hw/i386/pc.c                 | 126 +++++++++++++++++++++++++++++++++--
>  hw/i386/pc_piix.c            |  12 +++-
>  hw/i386/pc_q35.c             |  14 +++-
>  hw/i386/sgx.c                |   2 +-
>  hw/i386/x86.c                |   1 +
>  hw/pci-host/i440fx.c         |  10 ++-
>  include/hw/i386/pc.h         |   4 +-
>  include/hw/i386/x86.h        |   3 +
>  include/hw/pci-host/i440fx.h |   3 +-
>  10 files changed, 161 insertions(+), 16 deletions(-)
>

Re: [PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU

Reply via email to