Hi Zhenzhong
On 2/19/25 9:22 AM, Zhenzhong Duan wrote: > Hi, > > Per Jason Wang's suggestion, iommufd nesting series[1] is split into > "Enable stage-1 translation for emulated device" series and > "Enable stage-1 translation for passthrough device" series. > > This series is 2nd part focusing on passthrough device. We don't do > shadowing of guest page table for passthrough device but pass stage-1 > page table to host side to construct a nested domain. There was some > effort to enable this feature in old days, see [2] for details. > > The key design is to utilize the dual-stage IOMMU translation > (also known as IOMMU nested translation) capability in host IOMMU. > As the below diagram shows, guest I/O page table pointer in GPA > (guest physical address) is passed to host and be used to perform s/be/is > the stage-1 address translation. Along with it, modifications to > present mappings in the guest I/O page table should be followed > with an IOTLB invalidation. > > .-------------. .---------------------------. > | vIOMMU | | Guest I/O page table | > | | '---------------------------' > .----------------/ > | PASID Entry |--- PASID cache flush --+ > '-------------' | > | | V > | | I/O page table pointer in GPA > '-------------' > Guest > ------| Shadow |---------------------------|-------- > v v v > Host > .-------------. .------------------------. > | pIOMMU | | FS for GIOVA->GPA | > | | '------------------------' > .----------------/ | > | PASID Entry | V (Nested xlate) > '----------------\.----------------------------------. > | | | SS for GPA->HPA, unmanaged domain| > | | '----------------------------------' > '-------------' > Where: > - FS = First stage page tables > - SS = Second stage page tables > <Intel VT-d Nested translation> > > There are some interactions between VFIO and vIOMMU > * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI > subsystem. VFIO calls them to register/unregister HostIOMMUDevice > instance to vIOMMU at vfio device realize stage. > * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt > to bind/unbind device to IOMMUFD backed domains, either nested > domain or not. > > See below diagram: > > VFIO Device Intel IOMMU > .-----------------. .-------------------. > | | | | > | .---------|PCIIOMMUOps |.-------------. | > | | IOMMUFD |(set_iommu_device) || Host IOMMU | | > | | Device |------------------------>|| Device list | | > | .---------|(unset_iommu_device) |.-------------. | > | | | | | > | | | V | > | .---------| HostIOMMUDeviceIOMMUFD | .-------------. | > | | IOMMUFD | (attach_hwpt)| | Host IOMMU | | > | | link |<------------------------| | Device | | > | .---------| (detach_hwpt)| .-------------. | > | | | | | > | | | ... | > .-----------------. .-------------------. > > Based on Yi's suggestion, this design is optimal in sharing ioas/hwpt > whenever possible and create new one on demand, also supports multiple > iommufd objects and ERRATA_772415. > > E.g., Stage-2 page table could be shared by different devices if there > is no conflict and devices link to same iommufd object, i.e. devices > under same host IOMMU can share same stage-2 page table. If there is > conflict, i.e. there is one device under non cache coherency mode > which is different from others, it requires a separate stage-2 page > table in non-CC mode. > > SPR platform has ERRATA_772415 which requires no readonly mappings > in stage-2 page table. This series supports creating VTDIOASContainer > with no readonly mappings. If there is a rare case that some IOMMUs > on a multiple IOMMU host have ERRATA_772415 and others not, this > design can still survive. > > See below example diagram for a full view: > > IntelIOMMUState > | > V > .------------------. .------------------. .-------------------. > | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer > |-->... > | (iommufd0,RW&RO) | | (iommufd1,RW&RO) | | (iommufd0,RW only)| > .------------------. .------------------. .-------------------. > | | | > | .-->... | > V V > .-------------------. .-------------------. > .---------------. > | VTDS2Hwpt(CC) |--->| VTDS2Hwpt(non-CC) |-->... | VTDS2Hwpt(CC) > |-->... > .-------------------. .-------------------. > .---------------. > | | | | > | | | | > .-----------. .-----------. .------------. .------------. > | IOMMUFD | | IOMMUFD | | IOMMUFD | | IOMMUFD | > | Device(CC)| | Device(CC)| | Device | | Device(CC) | > | (iommufd0)| | (iommufd0)| | (non-CC) | | (errata) | > | | | | | (iommufd0) | | (iommufd0) | > .-----------. .-----------. .------------. .------------. > > This series is also a prerequisite work for vSVA, i.e. Sharing > guest application address space with passthrough devices. > > To enable stage-1 translation, only need to add > "x-scalable-mode=on,x-flts=on". > i.e. -device intel-iommu,x-scalable-mode=on,x-flts=on... > > Passthrough device should use iommufd backend to work with stage-1 > translation. > i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,... > > If host doesn't support nested translation, qemu will fail with an unsupported > report. you're not mentioning lack of error reporting from HW S1 faults to guests. Are there other deps missing? Eric > > Test done: > - VFIO devices hotplug/unplug > - different VFIO devices linked to different iommufds > - vhost net device ping test > > PATCH1-8: Add HWPT-based nesting infrastructure support > PATCH9-10: Some cleanup work > PATCH11: cap/ecap related compatibility check between vIOMMU and Host IOMMU > PATCH12-19:Implement stage-1 page table for passthrough device > PATCH20: Enable stage-1 translation for passthrough device > > Qemu code can be found at [3] > > TODO: > - RAM discard > - dirty tracking on stage-2 page table > > [1] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02740.html > [2] > https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1-yi.l....@intel.com/ > [3] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting_rfcv2 > > Thanks > Zhenzhong > > Changelog: > rfcv2: > - Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi) > - Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily rebase > - add two cleanup patches(patch9-10) > - VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of iommufd/devid/ioas_id > - add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as and > iommu pasid, this is important for dropping VTDPASIDAddressSpace > > Yi Liu (3): > intel_iommu: Replay pasid binds after context cache invalidation > intel_iommu: Propagate PASID-based iotlb invalidation to host > intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed > > Zhenzhong Duan (17): > backends/iommufd: Add helpers for invalidating user-managed HWPT > vfio/iommufd: Add properties and handlers to > TYPE_HOST_IOMMU_DEVICE_IOMMUFD > HostIOMMUDevice: Introduce realize_late callback > vfio/iommufd: Implement HostIOMMUDeviceClass::realize_late() handler > vfio/iommufd: Implement [at|de]tach_hwpt handlers > host_iommu_device: Define two new capabilities > HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP] > iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP] > iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_ERRATA > intel_iommu: Rename vtd_ce_get_rid2pasid_entry to > vtd_ce_get_pasid_entry > intel_iommu: Optimize context entry cache utilization > intel_iommu: Check for compatibility with IOMMUFD backed device when > x-flts=on > intel_iommu: Introduce a new structure VTDHostIOMMUDevice > intel_iommu: Add PASID cache management infrastructure > intel_iommu: Bind/unbind guest page table to host > intel_iommu: ERRATA_772415 workaround > intel_iommu: Bypass replay in stage-1 page table mode > intel_iommu: Enable host device when x-flts=on in scalable mode > > hw/i386/intel_iommu_internal.h | 56 + > include/hw/i386/intel_iommu.h | 33 +- > include/system/host_iommu_device.h | 40 + > include/system/iommufd.h | 53 + > backends/iommufd.c | 58 + > hw/i386/intel_iommu.c | 1660 ++++++++++++++++++++++++---- > hw/vfio/common.c | 17 +- > hw/vfio/iommufd.c | 48 + > backends/trace-events | 1 + > hw/i386/trace-events | 13 + > 10 files changed, 1776 insertions(+), 203 deletions(-) >