On 23/04/2025 11:56, Sairaj Kodilkar wrote:
> On 4/23/2025 4:15 PM, Sairaj Kodilkar wrote:
>> On 4/14/2025 7:32 AM, Alejandro Jimenez wrote:
>>> This series adds support for guests using the AMD vIOMMU to enable DMA
>>> remapping for VFIO devices. In addition to the currently supported
>>> passthrough (PT) mode, guest kernels are now able to to provide DMA
>>> address translation and access permission checking to VFs attached to
>>> paging domains, using the AMD v1 I/O page table format.
>>>
>>> These changes provide the essential emulation required to boot and
>>> support regular operation for a Linux guest enabling DMA remapping e.g.
>>> via kernel parameters "iommu=nopt" or "iommu.passthrough=0".
>>>
>>> A new amd-iommu device property "dma-remap" (default: off) is introduced
>>> to control whether the feature is available. See below for a full
>>> example of QEMU cmdline parameters used in testing.
>>>
>>> The patchset has been tested on an AMD EPYC Genoa host, with Linux 6.14
>>> host and guest kernels, launching guests with up to 256 vCPUs, 512G
>>> memory, and 16 CX6 VFs. Testing with IOMMU x2apic support enabled (i.e.
>>> xtsup=on) requires fix:
>>> https://lore.kernel.org/all/20250410064447.29583-3-sarun...@amd.com/
>>>
>>> Although there is more work to do, I am sending this series as a patch
>>> and not an RFC since it provides a working implementation of the
>>> feature. With this basic infrastructure in place it becomes easier to
>>> add/verify enhancements and new functionality. Here are some items I am
>>> working to address in follow up patches:
>>>
>>> - Page Fault and error reporting
>>> - Add QEMU tracing and tests
>>> - Provide control over VA Size advertised to guests
>>> - Support hotplug/unplug of devices and other advanced features
>>>    (suggestions welcomed)
>>>
>>> Thank you,
>>> Alejandro
>>>
>>> ---
>>> Example QEMU command line:
>>>
>>> $QEMU \
>>> -nodefaults \
>>> -snapshot \
>>> -no-user-config \
>>> -display none \
>>> -serial mon:stdio -nographic \
>>> -machine q35,accel=kvm,kernel_irqchip=split \
>>> -cpu host,+topoext,+x2apic,-svm,-vmx,-kvm-msi-ext-dest-id \
>>> -smp 32 \
>>> -m 128G \
>>> -kernel $KERNEL \
>>> -initrd $INITRD \
>>> -append "console=tty0 console=ttyS0 root=/dev/mapper/ol-root ro 
>>> rd.lvm.lv=ol/
>>> root rd.lvm.lv=ol/swap iommu.passthrough=0" \
>>> -device amd-iommu,intremap=on,xtsup=on,dma-remap=on \
>>> -blockdev node- name=drive0,driver=qcow2,file.driver=file,file.filename=./
>>> OracleLinux- uefi-x86_64.qcow2 \
>>> -device virtio-blk-pci,drive=drive0,id=virtio-disk0 \
>>> -drive if=pflash,format=raw,unit=0,file=/usr/share/edk2/ovmf/
>>> OVMF_CODE.fd,readonly=on \
>>> -drive if=pflash,format=raw,unit=1,file=./OVMF_VARS.fd \
>>> -device vfio-pci,host=0000:a1:00.1,id=net0
>>> ---
>>>
>>> Alejandro Jimenez (18):
>>>    memory: Adjust event ranges to fit within notifier boundaries
>>>    amd_iommu: Add helper function to extract the DTE
>>>    amd_iommu: Add support for IOMMU notifier
>>>    amd_iommu: Unmap all address spaces under the AMD IOMMU on reset
>>>    amd_iommu: Toggle memory regions based on address translation mode
>>>    amd_iommu: Set all address spaces to default translation mode on reset
>>>    amd_iommu: Return an error when unable to read PTE from guest memory
>>>    amd_iommu: Helper to decode size of page invalidation command
>>>    amd_iommu: Add helpers to walk AMD v1 Page Table format
>>>    amd_iommu: Add a page walker to sync shadow page tables on
>>>      invalidation
>>>    amd_iommu: Sync shadow page tables on page invalidation
>>>    amd_iommu: Add replay callback
>>>    amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL
>>>    amd_iommu: Toggle address translation on device table entry
>>>      invalidation
>>>    amd_iommu: Use iova_tree records to determine large page size on UNMAP
>>>    amd_iommu: Do not assume passthrough translation when DTE[TV]=0
>>>    amd_iommu: Refactor amdvi_page_walk() to use common code for page walk
>>>    amd_iommu: Do not emit I/O page fault events during replay()
>>>
>>>   hw/i386/amd_iommu.c | 856 ++++++++++++++++++++++++++++++++++++++++----
>>>   hw/i386/amd_iommu.h |  52 +++
>>>   system/memory.c     |  10 +-
>>>   3 files changed, 843 insertions(+), 75 deletions(-)
>>>
>>>
>>> base-commit: 56c6e249b6988c1b6edc2dd34ebb0f1e570a1365
>>
>> Hi Alejandro,
>> I tested the patches with FIO and VFIO (using guest's /dev/vfio/vfio)
>> tests inside the guest. Everything looks good to me.
>>
>> I also compared the fio performance with following parameters on a
>> passthrough nvme inside the guest with 16 vcpus.
>>
>> [FIO PARAMETERS]
>> NVMEs     = 1
>> JOBS/NVME = 16
>> MODE      = RANDREAD
>> IOENGINE  = LIBAIO
>> IODEPTH   = 32
>> BLOCKSIZE = 4K
>> SIZE      = 100%
>>
>>         RESULTS
>> =====================
>> Guest
>> IOMMU          IOPS
>> mode          (kilo)
>> =====================
>> nopt           13.7
>> pt           1191.0
>> --------------------
>>
>> I see that nopt (emulate IOMMU) has a huge performance.
> This is suppose to be "huge performance penalty", sorry about the typo
>> I wonder if the DMA remapping is really useful with such performance
>> penalty.

This is not so much about performance but more about guest compatibility (or
just not breaking guests) once you expose amd-iommu device (old or new guests;
as you can’t control what your guests are running), together with Windows OSes
which so far only works on Intel -- so this series brings parity; There’s also
more niche features (Windows Credential Guard or Windows firmware DMA protection
which requires a vIOMMU); and finally general testing/development for (v)IOMMU
with real VFs to work.

Reply via email to