On Thu, May 22, 2025 at 06:50:42AM +0000, Duan, Zhenzhong wrote:
> 
> 
> >-----Original Message-----
> >From: Nicolin Chen <nicol...@nvidia.com>
> >Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to
> >host
> >
> >On Wed, May 21, 2025 at 07:14:45PM +0800, Zhenzhong Duan wrote:
> >> +static const MemoryListener iommufd_s2domain_memory_listener = {
> >> +    .name = "iommufd_s2domain",
> >> +    .priority = 1000,
> >> +    .region_add = iommufd_listener_region_add_s2domain,
> >> +    .region_del = iommufd_listener_region_del_s2domain,
> >> +};
> >
> >Would you mind elaborating When and how vtd does all S2 mappings?
> 
> When guest trigger pasid cache invalidation, vIOMMU will attach device
> to stage2 page table if guest's PGTT=PT or nested page table if PGTT=Stage1.
> All these page tables are dynamically created during attach. We don't use
> VFIO's shadow page table. The S2 mappings are also created during attach.

OK. That I can understand.

Next question: what does VTD actually map onto the S2 page table?
The entire guest RAM? Or just a part of that?

On ARM, the VFIO listener would capture the entire RAM, and map it
on S2 page table. I wonder if VTD would do the same.

> >On ARM, the default vfio_memory_listener could capture the entire
> >guest RAM and add to the address space. So what we do is basically
> >reusing the vfio_memory_listener:
> >https://lore.kernel.org/qemu-devel/20250311141045.66620-13-
> >shameerali.kolothum.th...@huawei.com/
> >
> >The thing is that when a VFIO device is attached to the container
> >upon a nesting configuration, the ->get_address_space op should
> >return the system address space as S1 nested HWPT isn't allocated
> >yet. Then all the iommu as routines in vfio_listener_region_add()
> >would be skipped, ending up with mapping the guest RAM in S2 HWPT
> >correctly. Not until the S1 nested HWPT is allocated by the guest
> >OS (after guest boots), can the ->get_address_space op return the
> >iommu address space.
> 
> When S1 hwpt is allocated by guest, who will notify VFIO to call
> ->get_address_space op() again to get iommu address space?

Hmm, would you please elaborate why VFIO needs to call that again?

I can see VFIO create the MAP/UNMAP notifiers for an iommu address
space. However, the device operating in the nested translation mode
should go through IOMMU HW for these two:
 - S1 page table (MAP) will be created by the guest OS
 - S1 invalidation (UNMAP) will be issued by the guest OS, and then
   trapped by QEMU to forward to the HWPT uAPI to the host kernel.

As you mentioned, there is no need of a shadow page table in this
mode. What else does VT-d need from an iommu address space?

On ARM, the only reason that we shift address space, is for KVM to
inject MSI, as it only has the gIOVA and requires the iommu address
space to translate that to gPA. Refer to kvm_arch_fixup_msi_route()
in target/arm/kvm.c where it calls pci_device_iommu_address_space.

> >With this address space shift, S2 mappings can be simply captured
> >and done by vfio_memory_listener. Then, such an s2domain listener
> >would be largely redundant.
> 
> I didn't get how arm smmu supports switching address space, will VFIO call
> ->get_address_space() twice, once to get system address space and the other
> for iommu address space?

The set_iommu_device() attaches the device to an stage2 page table
by default, indicating that the device works in the S1 passthrough
mode (for VTD, that's PGTT=PT) at VM creation. And this is where
the system address space should be returned by get_address_space().

If the guest kernel sets an S1 Translate mode for the device (for
VTD, that's PGTT=Stage1), QEMU would trap that and allocate an S1
HWPT for device to attach. Starting from here, get_address_space()
can return the iommu address space -- on ARM, we only need it for
KVM to translate MSI.

If the guest kernel sets an S1 Bypass mode for the device (for VTD,
that's PGTT=PT), the device would continue stay in the system AS,
i.e. get_address_space() wouldn't need to switch.

> >> +static int vtd_create_s1_hwpt(HostIOMMUDeviceIOMMUFD *idev,
> >> +                              VTDS2Hwpt *s2_hwpt, VTDHwpt *hwpt,
> >> +                              VTDPASIDEntry *pe, Error **errp)
> >> +{
> >> +    struct iommu_hwpt_vtd_s1 vtd;
> >> +    uint32_t hwpt_id, s2_hwpt_id = s2_hwpt->hwpt_id;
> >> +
> >> +    vtd_init_s1_hwpt_data(&vtd, pe);
> >> +
> >> +    if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
> >> +                                    s2_hwpt_id, 0, IOMMU_HWPT_DATA_VTD_S1,
> >> +                                    sizeof(vtd), &vtd, &hwpt_id, errp)) {
> >> +        return -EINVAL;
> >> +    }
> >> +
> >> +    hwpt->hwpt_id = hwpt_id;
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +static void vtd_destroy_s1_hwpt(HostIOMMUDeviceIOMMUFD *idev,
> >VTDHwpt *hwpt)
> >> +{
> >> +    iommufd_backend_free_id(idev->iommufd, hwpt->hwpt_id);
> >> +}
> >
> >I think you did some substantial work to isolate the get_hw_info
> >part inside the iommufd backend code, which looks nice and clean
> >as the vIOMMU code simply does iodc->get_cap().
> >
> >However, that then makes these direct raw backend function calls
> >very awkward :-/
> >
> >In my view, the way to make sense is either:
> >* We don't do any isolation, but just call raw backend functions
> >  in vIOMMU code
> >* We do every isolation, and never call raw backend functions in
> >  vIOMMU code
> 
> Iommufd backend functions are general for all modules usage including
> vIOMMU. I think we are not blocking vIOMMU calling raw backend functions.

Well, I am not against doing that. Calling the raw iommufd APIs is
easier :)

> We just provide a general interface for querying capabilities, not all 
> capabilities
> are from iommufd get_hw_info result, e.g., aw_bits.

Question is why we need a general interface for get_hw_info(), yet
not for other iommufd APIs.

IIRC, Cedirc's suggested a general interface for get_hw_info so the
vIOMMU code can be unaware of the underlying source (doesn't matter
if it's from iommufd or from vfio).

I see aw_bits from QEMU command line. Mind elaborating?

Thanks
Nicolin

Reply via email to