Hi Zhenzhong

On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
> Hi,
>
> Per Jason Wang's suggestion, iommufd nesting series[1] is split into
> "Enable stage-1 translation for emulated device" series and
> "Enable stage-1 translation for passthrough device" series.
>
> This series is 2nd part focusing on passthrough device. We don't do
> shadowing of guest page table for passthrough device but pass stage-1
> page table to host side to construct a nested domain. There was some
> effort to enable this feature in old days, see [2] for details.
>
> The key design is to utilize the dual-stage IOMMU translation
> (also known as IOMMU nested translation) capability in host IOMMU.
> As the below diagram shows, guest I/O page table pointer in GPA
> (guest physical address) is passed to host and be used to perform
s/be/is
> the stage-1 address translation. Along with it, modifications to
> present mappings in the guest I/O page table should be followed
> with an IOTLB invalidation.
>
>         .-------------.  .---------------------------.
>         |   vIOMMU    |  | Guest I/O page table      |
>         |             |  '---------------------------'
>         .----------------/
>         | PASID Entry |--- PASID cache flush --+
>         '-------------'                        |
>         |             |                        V
>         |             |           I/O page table pointer in GPA
>         '-------------'
>     Guest
>     ------| Shadow |---------------------------|--------
>           v        v                           v
>     Host
>         .-------------.  .------------------------.
>         |   pIOMMU    |  |  FS for GIOVA->GPA     |
>         |             |  '------------------------'
>         .----------------/  |
>         | PASID Entry |     V (Nested xlate)
>         '----------------\.----------------------------------.
>         |             |   | SS for GPA->HPA, unmanaged domain|
>         |             |   '----------------------------------'
>         '-------------'
> Where:
>  - FS = First stage page tables
>  - SS = Second stage page tables
> <Intel VT-d Nested translation>
>
> There are some interactions between VFIO and vIOMMU
> * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
>   subsystem. VFIO calls them to register/unregister HostIOMMUDevice
>   instance to vIOMMU at vfio device realize stage.
> * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
>   to bind/unbind device to IOMMUFD backed domains, either nested
>   domain or not.
>
> See below diagram:
>
>         VFIO Device                                 Intel IOMMU
>     .-----------------.                         .-------------------.
>     |                 |                         |                   |
>     |       .---------|PCIIOMMUOps              |.-------------.    |
>     |       | IOMMUFD |(set_iommu_device)       || Host IOMMU  |    |
>     |       | Device  |------------------------>|| Device list |    |
>     |       .---------|(unset_iommu_device)     |.-------------.    |
>     |                 |                         |       |           |
>     |                 |                         |       V           |
>     |       .---------|  HostIOMMUDeviceIOMMUFD |  .-------------.  |
>     |       | IOMMUFD |            (attach_hwpt)|  | Host IOMMU  |  |
>     |       | link    |<------------------------|  |   Device    |  |
>     |       .---------|            (detach_hwpt)|  .-------------.  |
>     |                 |                         |       |           |
>     |                 |                         |       ...         |
>     .-----------------.                         .-------------------.
>
> Based on Yi's suggestion, this design is optimal in sharing ioas/hwpt
> whenever possible and create new one on demand, also supports multiple
> iommufd objects and ERRATA_772415.
>
> E.g., Stage-2 page table could be shared by different devices if there
> is no conflict and devices link to same iommufd object, i.e. devices
> under same host IOMMU can share same stage-2 page table. If there is
> conflict, i.e. there is one device under non cache coherency mode
> which is different from others, it requires a separate stage-2 page
> table in non-CC mode.
>
> SPR platform has ERRATA_772415 which requires no readonly mappings
> in stage-2 page table. This series supports creating VTDIOASContainer
> with no readonly mappings. If there is a rare case that some IOMMUs
> on a multiple IOMMU host have ERRATA_772415 and others not, this
> design can still survive.
>
> See below example diagram for a full view:
>
>       IntelIOMMUState
>              |
>              V
>     .------------------.    .------------------.    .-------------------.
>     | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer  
> |-->...
>     | (iommufd0,RW&RO) |    | (iommufd1,RW&RO) |    | (iommufd0,RW only)|
>     .------------------.    .------------------.    .-------------------.
>              |                       |                              |
>              |                       .-->...                        |
>              V                                                      V
>       .-------------------.    .-------------------.          
> .---------------.
>       |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...    | VTDS2Hwpt(CC) 
> |-->...
>       .-------------------.    .-------------------.          
> .---------------.
>           |            |               |                            |
>           |            |               |                            |
>     .-----------.  .-----------.  .------------.              .------------.
>     | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |              | IOMMUFD    |
>     | Device(CC)|  | Device(CC)|  | Device     |              | Device(CC) |
>     | (iommufd0)|  | (iommufd0)|  | (non-CC)   |              | (errata)   |
>     |           |  |           |  | (iommufd0) |              | (iommufd0) |
>     .-----------.  .-----------.  .------------.              .------------.
>
> This series is also a prerequisite work for vSVA, i.e. Sharing
> guest application address space with passthrough devices.
>
> To enable stage-1 translation, only need to add 
> "x-scalable-mode=on,x-flts=on".
> i.e. -device intel-iommu,x-scalable-mode=on,x-flts=on...
>
> Passthrough device should use iommufd backend to work with stage-1 
> translation.
> i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
>
> If host doesn't support nested translation, qemu will fail with an unsupported
> report.

you're not mentioning lack of error reporting from HW S1 faults to
guests. Are there other deps missing?

Eric
>
> Test done:
> - VFIO devices hotplug/unplug
> - different VFIO devices linked to different iommufds
> - vhost net device ping test
>
> PATCH1-8:  Add HWPT-based nesting infrastructure support
> PATCH9-10: Some cleanup work
> PATCH11:   cap/ecap related compatibility check between vIOMMU and Host IOMMU
> PATCH12-19:Implement stage-1 page table for passthrough device
> PATCH20:   Enable stage-1 translation for passthrough device
>
> Qemu code can be found at [3]
>
> TODO:
> - RAM discard
> - dirty tracking on stage-2 page table
>
> [1] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02740.html
> [2] 
> https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1-yi.l....@intel.com/
> [3] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting_rfcv2
>
> Thanks
> Zhenzhong
>
> Changelog:
> rfcv2:
> - Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi)
> - Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily rebase
> - add two cleanup patches(patch9-10)
> - VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of iommufd/devid/ioas_id
> - add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as and
>   iommu pasid, this is important for dropping VTDPASIDAddressSpace
>
> Yi Liu (3):
>   intel_iommu: Replay pasid binds after context cache invalidation
>   intel_iommu: Propagate PASID-based iotlb invalidation to host
>   intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed
>
> Zhenzhong Duan (17):
>   backends/iommufd: Add helpers for invalidating user-managed HWPT
>   vfio/iommufd: Add properties and handlers to
>     TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>   HostIOMMUDevice: Introduce realize_late callback
>   vfio/iommufd: Implement HostIOMMUDeviceClass::realize_late() handler
>   vfio/iommufd: Implement [at|de]tach_hwpt handlers
>   host_iommu_device: Define two new capabilities
>     HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
>   iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
>   iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_ERRATA
>   intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
>     vtd_ce_get_pasid_entry
>   intel_iommu: Optimize context entry cache utilization
>   intel_iommu: Check for compatibility with IOMMUFD backed device when
>     x-flts=on
>   intel_iommu: Introduce a new structure VTDHostIOMMUDevice
>   intel_iommu: Add PASID cache management infrastructure
>   intel_iommu: Bind/unbind guest page table to host
>   intel_iommu: ERRATA_772415 workaround
>   intel_iommu: Bypass replay in stage-1 page table mode
>   intel_iommu: Enable host device when x-flts=on in scalable mode
>
>  hw/i386/intel_iommu_internal.h     |   56 +
>  include/hw/i386/intel_iommu.h      |   33 +-
>  include/system/host_iommu_device.h |   40 +
>  include/system/iommufd.h           |   53 +
>  backends/iommufd.c                 |   58 +
>  hw/i386/intel_iommu.c              | 1660 ++++++++++++++++++++++++----
>  hw/vfio/common.c                   |   17 +-
>  hw/vfio/iommufd.c                  |   48 +
>  backends/trace-events              |    1 +
>  hw/i386/trace-events               |   13 +
>  10 files changed, 1776 insertions(+), 203 deletions(-)
>


Reply via email to