On 2025/5/23 03:29, Nicolin Chen wrote:
On Thu, May 22, 2025 at 06:50:42AM +0000, Duan, Zhenzhong wrote:


-----Original Message-----
From: Nicolin Chen <nicol...@nvidia.com>
Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to
host

On Wed, May 21, 2025 at 07:14:45PM +0800, Zhenzhong Duan wrote:
+static const MemoryListener iommufd_s2domain_memory_listener = {
+    .name = "iommufd_s2domain",
+    .priority = 1000,
+    .region_add = iommufd_listener_region_add_s2domain,
+    .region_del = iommufd_listener_region_del_s2domain,
+};

Would you mind elaborating When and how vtd does all S2 mappings?

When guest trigger pasid cache invalidation, vIOMMU will attach device
to stage2 page table if guest's PGTT=PT or nested page table if PGTT=Stage1.
All these page tables are dynamically created during attach. We don't use
VFIO's shadow page table. The S2 mappings are also created during attach.

OK. That I can understand.

Next question: what does VTD actually map onto the S2 page table?
The entire guest RAM? Or just a part of that?

On ARM, the VFIO listener would capture the entire RAM, and map it
on S2 page table. I wonder if VTD would do the same.

On ARM, the default vfio_memory_listener could capture the entire
guest RAM and add to the address space. So what we do is basically
reusing the vfio_memory_listener:
https://lore.kernel.org/qemu-devel/20250311141045.66620-13-
shameerali.kolothum.th...@huawei.com/

The thing is that when a VFIO device is attached to the container
upon a nesting configuration, the ->get_address_space op should
return the system address space as S1 nested HWPT isn't allocated
yet. Then all the iommu as routines in vfio_listener_region_add()
would be skipped, ending up with mapping the guest RAM in S2 HWPT
correctly. Not until the S1 nested HWPT is allocated by the guest
OS (after guest boots), can the ->get_address_space op return the
iommu address space.

When S1 hwpt is allocated by guest, who will notify VFIO to call
->get_address_space op() again to get iommu address space?

Hmm, would you please elaborate why VFIO needs to call that again?

I can see VFIO create the MAP/UNMAP notifiers for an iommu address
space. However, the device operating in the nested translation mode
should go through IOMMU HW for these two:
  - S1 page table (MAP) will be created by the guest OS
  - S1 invalidation (UNMAP) will be issued by the guest OS, and then
    trapped by QEMU to forward to the HWPT uAPI to the host kernel.

As you mentioned, there is no need of a shadow page table in this
mode. What else does VT-d need from an iommu address space?

On ARM, the only reason that we shift address space, is for KVM to
inject MSI, as it only has the gIOVA and requires the iommu address
space to translate that to gPA. Refer to kvm_arch_fixup_msi_route()
in target/arm/kvm.c where it calls pci_device_iommu_address_space.

With this address space shift, S2 mappings can be simply captured
and done by vfio_memory_listener. Then, such an s2domain listener
would be largely redundant.

I didn't get how arm smmu supports switching address space, will VFIO call
->get_address_space() twice, once to get system address space and the other
for iommu address space?

The set_iommu_device() attaches the device to an stage2 page table

hmmm. I'm not sure if this is accurate. I think this set_iommu_device
just acts as setting some handle for this particular device to the vIOMMU
side. It has not idea about what address space nor page table.

by default, indicating that the device works in the S1 passthrough
mode (for VTD, that's PGTT=PT) at VM creation. And this is where
the system address space should be returned by get_address_space().

If the guest kernel sets an S1 Translate mode for the device (for
VTD, that's PGTT=Stage1), QEMU would trap that and allocate an S1
HWPT for device to attach. Starting from here, get_address_space()
can return the iommu address space -- on ARM, we only need it for
KVM to translate MSI.


refer to the last reply. This seems to be different between ARM and VT-d
emulation.

--
Regards,
Yi Liu

Reply via email to