On 4/23/2025 4:15 PM, Sairaj Kodilkar wrote:


On 4/14/2025 7:32 AM, Alejandro Jimenez wrote:
This series adds support for guests using the AMD vIOMMU to enable DMA
remapping for VFIO devices. In addition to the currently supported
passthrough (PT) mode, guest kernels are now able to to provide DMA
address translation and access permission checking to VFs attached to
paging domains, using the AMD v1 I/O page table format.

These changes provide the essential emulation required to boot and
support regular operation for a Linux guest enabling DMA remapping e.g.
via kernel parameters "iommu=nopt" or "iommu.passthrough=0".

A new amd-iommu device property "dma-remap" (default: off) is introduced
to control whether the feature is available. See below for a full
example of QEMU cmdline parameters used in testing.

The patchset has been tested on an AMD EPYC Genoa host, with Linux 6.14
host and guest kernels, launching guests with up to 256 vCPUs, 512G
memory, and 16 CX6 VFs. Testing with IOMMU x2apic support enabled (i.e.
xtsup=on) requires fix:
https://lore.kernel.org/all/20250410064447.29583-3-sarun...@amd.com/

Although there is more work to do, I am sending this series as a patch
and not an RFC since it provides a working implementation of the
feature. With this basic infrastructure in place it becomes easier to
add/verify enhancements and new functionality. Here are some items I am
working to address in follow up patches:

- Page Fault and error reporting
- Add QEMU tracing and tests
- Provide control over VA Size advertised to guests
- Support hotplug/unplug of devices and other advanced features
   (suggestions welcomed)

Thank you,
Alejandro

---
Example QEMU command line:

$QEMU \
-nodefaults \
-snapshot \
-no-user-config \
-display none \
-serial mon:stdio -nographic \
-machine q35,accel=kvm,kernel_irqchip=split \
-cpu host,+topoext,+x2apic,-svm,-vmx,-kvm-msi-ext-dest-id \
-smp 32 \
-m 128G \
-kernel $KERNEL \
-initrd $INITRD \
-append "console=tty0 console=ttyS0 root=/dev/mapper/ol-root ro rd.lvm.lv=ol/root rd.lvm.lv=ol/swap iommu.passthrough=0" \
-device amd-iommu,intremap=on,xtsup=on,dma-remap=on \
-blockdev node- name=drive0,driver=qcow2,file.driver=file,file.filename=./OracleLinux- uefi-x86_64.qcow2 \
-device virtio-blk-pci,drive=drive0,id=virtio-disk0 \
-drive if=pflash,format=raw,unit=0,file=/usr/share/edk2/ovmf/ OVMF_CODE.fd,readonly=on \
-drive if=pflash,format=raw,unit=1,file=./OVMF_VARS.fd \
-device vfio-pci,host=0000:a1:00.1,id=net0
---

Alejandro Jimenez (18):
   memory: Adjust event ranges to fit within notifier boundaries
   amd_iommu: Add helper function to extract the DTE
   amd_iommu: Add support for IOMMU notifier
   amd_iommu: Unmap all address spaces under the AMD IOMMU on reset
   amd_iommu: Toggle memory regions based on address translation mode
   amd_iommu: Set all address spaces to default translation mode on reset
   amd_iommu: Return an error when unable to read PTE from guest memory
   amd_iommu: Helper to decode size of page invalidation command
   amd_iommu: Add helpers to walk AMD v1 Page Table format
   amd_iommu: Add a page walker to sync shadow page tables on
     invalidation
   amd_iommu: Sync shadow page tables on page invalidation
   amd_iommu: Add replay callback
   amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL
   amd_iommu: Toggle address translation on device table entry
     invalidation
   amd_iommu: Use iova_tree records to determine large page size on UNMAP
   amd_iommu: Do not assume passthrough translation when DTE[TV]=0
   amd_iommu: Refactor amdvi_page_walk() to use common code for page walk
   amd_iommu: Do not emit I/O page fault events during replay()

  hw/i386/amd_iommu.c | 856 ++++++++++++++++++++++++++++++++++++++++----
  hw/i386/amd_iommu.h |  52 +++
  system/memory.c     |  10 +-
  3 files changed, 843 insertions(+), 75 deletions(-)


base-commit: 56c6e249b6988c1b6edc2dd34ebb0f1e570a1365

Hi Alejandro,
I tested the patches with FIO and VFIO (using guest's /dev/vfio/vfio)
tests inside the guest. Everything looks good to me.

I also compared the fio performance with following parameters on a
passthrough nvme inside the guest with 16 vcpus.

[FIO PARAMETERS]
NVMEs     = 1
JOBS/NVME = 16
MODE      = RANDREAD
IOENGINE  = LIBAIO
IODEPTH   = 32
BLOCKSIZE = 4K
SIZE      = 100%

        RESULTS
=====================
Guest
IOMMU          IOPS
mode          (kilo)
=====================
nopt           13.7
pt           1191.0
--------------------

I see that nopt (emulate IOMMU) has a huge performance.
This is suppose to be "huge performance penalty", sorry about the typo
I wonder if the DMA remapping is really useful with such performance
penalty.

Regards
Sairaj Kodilkar



Reply via email to