Hi, Per Jason Wang's suggestion, iommufd nesting series[1] is split into "Enable stage-1 translation for emulated device" series and "Enable stage-1 translation for passthrough device" series.
This series is 2nd part focusing on passthrough device. We don't do shadowing of guest page table for passthrough device but pass stage-1 page table to host side to construct a nested domain. There was some effort to enable this feature in old days, see [2] for details. The key design is to utilize the dual-stage IOMMU translation (also known as IOMMU nested translation) capability in host IOMMU. As the below diagram shows, guest I/O page table pointer in GPA (guest physical address) is passed to host and be used to perform the stage-1 address translation. Along with it, modifications to present mappings in the guest I/O page table should be followed with an IOTLB invalidation. .-------------. .---------------------------. | vIOMMU | | Guest I/O page table | | | '---------------------------' .----------------/ | PASID Entry |--- PASID cache flush --+ '-------------' | | | V | | I/O page table pointer in GPA '-------------' Guest ------| Shadow |---------------------------|-------- v v v Host .-------------. .------------------------. | pIOMMU | | FS for GIOVA->GPA | | | '------------------------' .----------------/ | | PASID Entry | V (Nested xlate) '----------------\.----------------------------------. | | | SS for GPA->HPA, unmanaged domain| | | '----------------------------------' '-------------' Where: - FS = First stage page tables - SS = Second stage page tables <Intel VT-d Nested translation> There are some interactions between VFIO and vIOMMU * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI subsystem. VFIO calls them to register/unregister HostIOMMUDevice instance to vIOMMU at vfio device realize stage. * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt to bind/unbind device to IOMMUFD backed domains, either nested domain or not. See below diagram: VFIO Device Intel IOMMU .-----------------. .-------------------. | | | | | .---------|PCIIOMMUOps |.-------------. | | | IOMMUFD |(set_iommu_device) || Host IOMMU | | | | Device |------------------------>|| Device list | | | .---------|(unset_iommu_device) |.-------------. | | | | | | | | | V | | .---------| HostIOMMUDeviceIOMMUFD | .-------------. | | | IOMMUFD | (attach_hwpt)| | Host IOMMU | | | | link |<------------------------| | Device | | | .---------| (detach_hwpt)| .-------------. | | | | | | | | | ... | .-----------------. .-------------------. Based on Yi's suggestion, this design is optimal in sharing ioas/hwpt whenever possible and create new one on demand, also supports multiple iommufd objects and ERRATA_772415. E.g., Stage-2 page table could be shared by different devices if there is no conflict and devices link to same iommufd object, i.e. devices under same host IOMMU can share same stage-2 page table. If there is conflict, i.e. there is one device under non cache coherency mode which is different from others, it requires a separate stage-2 page table in non-CC mode. SPR platform has ERRATA_772415 which requires no readonly mappings in stage-2 page table. This series supports creating VTDIOASContainer with no readonly mappings. If there is a rare case that some IOMMUs on a multiple IOMMU host have ERRATA_772415 and others not, this design can still survive. See below example diagram for a full view: IntelIOMMUState | V .------------------. .------------------. .-------------------. | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer |-->... | (iommufd0,RW&RO) | | (iommufd1,RW&RO) | | (iommufd0,RW only)| .------------------. .------------------. .-------------------. | | | | .-->... | V V .-------------------. .-------------------. .---------------. | VTDS2Hwpt(CC) |--->| VTDS2Hwpt(non-CC) |-->... | VTDS2Hwpt(CC) |-->... .-------------------. .-------------------. .---------------. | | | | | | | | .-----------. .-----------. .------------. .------------. | IOMMUFD | | IOMMUFD | | IOMMUFD | | IOMMUFD | | Device(CC)| | Device(CC)| | Device | | Device(CC) | | (iommufd0)| | (iommufd0)| | (non-CC) | | (errata) | | | | | | (iommufd0) | | (iommufd0) | .-----------. .-----------. .------------. .------------. This series is also a prerequisite work for vSVA, i.e. Sharing guest application address space with passthrough devices. To enable stage-1 translation, only need to add "x-scalable-mode=on,x-flts=on". i.e. -device intel-iommu,x-scalable-mode=on,x-flts=on... Passthrough device should use iommufd backend to work with stage-1 translation. i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,... If host doesn't support nested translation, qemu will fail with an unsupported report. Test done: - VFIO devices hotplug/unplug - different VFIO devices linked to different iommufds - vhost net device ping test PATCH1-8: Add HWPT-based nesting infrastructure support PATCH9-10: Some cleanup work PATCH11: cap/ecap related compatibility check between vIOMMU and Host IOMMU PATCH12-19:Implement stage-1 page table for passthrough device PATCH20: Enable stage-1 translation for passthrough device Qemu code can be found at [3] TODO: - RAM discard - dirty tracking on stage-2 page table [1] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02740.html [2] https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1-yi.l....@intel.com/ [3] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting_rfcv2 Thanks Zhenzhong Changelog: rfcv2: - Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi) - Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily rebase - add two cleanup patches(patch9-10) - VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of iommufd/devid/ioas_id - add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as and iommu pasid, this is important for dropping VTDPASIDAddressSpace Yi Liu (3): intel_iommu: Replay pasid binds after context cache invalidation intel_iommu: Propagate PASID-based iotlb invalidation to host intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed Zhenzhong Duan (17): backends/iommufd: Add helpers for invalidating user-managed HWPT vfio/iommufd: Add properties and handlers to TYPE_HOST_IOMMU_DEVICE_IOMMUFD HostIOMMUDevice: Introduce realize_late callback vfio/iommufd: Implement HostIOMMUDeviceClass::realize_late() handler vfio/iommufd: Implement [at|de]tach_hwpt handlers host_iommu_device: Define two new capabilities HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP] iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP] iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_ERRATA intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry intel_iommu: Optimize context entry cache utilization intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on intel_iommu: Introduce a new structure VTDHostIOMMUDevice intel_iommu: Add PASID cache management infrastructure intel_iommu: Bind/unbind guest page table to host intel_iommu: ERRATA_772415 workaround intel_iommu: Bypass replay in stage-1 page table mode intel_iommu: Enable host device when x-flts=on in scalable mode hw/i386/intel_iommu_internal.h | 56 + include/hw/i386/intel_iommu.h | 33 +- include/system/host_iommu_device.h | 40 + include/system/iommufd.h | 53 + backends/iommufd.c | 58 + hw/i386/intel_iommu.c | 1660 ++++++++++++++++++++++++---- hw/vfio/common.c | 17 +- hw/vfio/iommufd.c | 48 + backends/trace-events | 1 + hw/i386/trace-events | 13 + 10 files changed, 1776 insertions(+), 203 deletions(-) -- 2.34.1