On 23/04/2025 11:56, Sairaj Kodilkar wrote: > On 4/23/2025 4:15 PM, Sairaj Kodilkar wrote: >> On 4/14/2025 7:32 AM, Alejandro Jimenez wrote: >>> This series adds support for guests using the AMD vIOMMU to enable DMA >>> remapping for VFIO devices. In addition to the currently supported >>> passthrough (PT) mode, guest kernels are now able to to provide DMA >>> address translation and access permission checking to VFs attached to >>> paging domains, using the AMD v1 I/O page table format. >>> >>> These changes provide the essential emulation required to boot and >>> support regular operation for a Linux guest enabling DMA remapping e.g. >>> via kernel parameters "iommu=nopt" or "iommu.passthrough=0". >>> >>> A new amd-iommu device property "dma-remap" (default: off) is introduced >>> to control whether the feature is available. See below for a full >>> example of QEMU cmdline parameters used in testing. >>> >>> The patchset has been tested on an AMD EPYC Genoa host, with Linux 6.14 >>> host and guest kernels, launching guests with up to 256 vCPUs, 512G >>> memory, and 16 CX6 VFs. Testing with IOMMU x2apic support enabled (i.e. >>> xtsup=on) requires fix: >>> https://lore.kernel.org/all/20250410064447.29583-3-sarun...@amd.com/ >>> >>> Although there is more work to do, I am sending this series as a patch >>> and not an RFC since it provides a working implementation of the >>> feature. With this basic infrastructure in place it becomes easier to >>> add/verify enhancements and new functionality. Here are some items I am >>> working to address in follow up patches: >>> >>> - Page Fault and error reporting >>> - Add QEMU tracing and tests >>> - Provide control over VA Size advertised to guests >>> - Support hotplug/unplug of devices and other advanced features >>> (suggestions welcomed) >>> >>> Thank you, >>> Alejandro >>> >>> --- >>> Example QEMU command line: >>> >>> $QEMU \ >>> -nodefaults \ >>> -snapshot \ >>> -no-user-config \ >>> -display none \ >>> -serial mon:stdio -nographic \ >>> -machine q35,accel=kvm,kernel_irqchip=split \ >>> -cpu host,+topoext,+x2apic,-svm,-vmx,-kvm-msi-ext-dest-id \ >>> -smp 32 \ >>> -m 128G \ >>> -kernel $KERNEL \ >>> -initrd $INITRD \ >>> -append "console=tty0 console=ttyS0 root=/dev/mapper/ol-root ro >>> rd.lvm.lv=ol/ >>> root rd.lvm.lv=ol/swap iommu.passthrough=0" \ >>> -device amd-iommu,intremap=on,xtsup=on,dma-remap=on \ >>> -blockdev node- name=drive0,driver=qcow2,file.driver=file,file.filename=./ >>> OracleLinux- uefi-x86_64.qcow2 \ >>> -device virtio-blk-pci,drive=drive0,id=virtio-disk0 \ >>> -drive if=pflash,format=raw,unit=0,file=/usr/share/edk2/ovmf/ >>> OVMF_CODE.fd,readonly=on \ >>> -drive if=pflash,format=raw,unit=1,file=./OVMF_VARS.fd \ >>> -device vfio-pci,host=0000:a1:00.1,id=net0 >>> --- >>> >>> Alejandro Jimenez (18): >>> memory: Adjust event ranges to fit within notifier boundaries >>> amd_iommu: Add helper function to extract the DTE >>> amd_iommu: Add support for IOMMU notifier >>> amd_iommu: Unmap all address spaces under the AMD IOMMU on reset >>> amd_iommu: Toggle memory regions based on address translation mode >>> amd_iommu: Set all address spaces to default translation mode on reset >>> amd_iommu: Return an error when unable to read PTE from guest memory >>> amd_iommu: Helper to decode size of page invalidation command >>> amd_iommu: Add helpers to walk AMD v1 Page Table format >>> amd_iommu: Add a page walker to sync shadow page tables on >>> invalidation >>> amd_iommu: Sync shadow page tables on page invalidation >>> amd_iommu: Add replay callback >>> amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL >>> amd_iommu: Toggle address translation on device table entry >>> invalidation >>> amd_iommu: Use iova_tree records to determine large page size on UNMAP >>> amd_iommu: Do not assume passthrough translation when DTE[TV]=0 >>> amd_iommu: Refactor amdvi_page_walk() to use common code for page walk >>> amd_iommu: Do not emit I/O page fault events during replay() >>> >>> hw/i386/amd_iommu.c | 856 ++++++++++++++++++++++++++++++++++++++++---- >>> hw/i386/amd_iommu.h | 52 +++ >>> system/memory.c | 10 +- >>> 3 files changed, 843 insertions(+), 75 deletions(-) >>> >>> >>> base-commit: 56c6e249b6988c1b6edc2dd34ebb0f1e570a1365 >> >> Hi Alejandro, >> I tested the patches with FIO and VFIO (using guest's /dev/vfio/vfio) >> tests inside the guest. Everything looks good to me. >> >> I also compared the fio performance with following parameters on a >> passthrough nvme inside the guest with 16 vcpus. >> >> [FIO PARAMETERS] >> NVMEs = 1 >> JOBS/NVME = 16 >> MODE = RANDREAD >> IOENGINE = LIBAIO >> IODEPTH = 32 >> BLOCKSIZE = 4K >> SIZE = 100% >> >> RESULTS >> ===================== >> Guest >> IOMMU IOPS >> mode (kilo) >> ===================== >> nopt 13.7 >> pt 1191.0 >> -------------------- >> >> I see that nopt (emulate IOMMU) has a huge performance. > This is suppose to be "huge performance penalty", sorry about the typo >> I wonder if the DMA remapping is really useful with such performance >> penalty.
This is not so much about performance but more about guest compatibility (or just not breaking guests) once you expose amd-iommu device (old or new guests; as you can’t control what your guests are running), together with Windows OSes which so far only works on Intel -- so this series brings parity; There’s also more niche features (Windows Credential Guard or Windows firmware DMA protection which requires a vIOMMU); and finally general testing/development for (v)IOMMU with real VFs to work.