unbind guest page table to host

Yi Liu Thu, 12 Jun 2025 05:49:28 -0700

On 2025/5/28 15:12, Duan, Zhenzhong wrote:

-----Original Message-----
From: Nicolin Chen <nicol...@nvidia.com>
Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to
host

OK. Let me clarify this at the top as I see the gap here now:

First, the vSMMU model is based on Zhenzhong's older series that
keeps an ioas_id in the HostIOMMUDeviceIOMMUFD structure, which
now it only keeps an hwpt_id in this RFCv3 series. This ioas_id
is allocated when a passthrough cdev attaches to a VFIO container.

Second, the vSMMU model reuses the default IOAS via that ioas_id.
Since the VFIO container doesn't allocate a nesting parent S2 HWPT
(maybe it could?), so the vSMMU allocates another S2 HWPT in the
vIOMMU code.

Third, the vSMMU model, for invalidation efficiency and HW Queue
support, isolates all emulated devices out of the nesting-enabled
vSMMU instance, suggested by Jason. So, only passthrough devices
would use the nesting-enabled vSMMU instance, meaning there is no
need of IOMMU_NOTIFIER_IOTLB_EVENTS:


I see, then you need to check if there is emulated device under nesting-enabled 
vSMMU and fail if there is.

- MAP is not needed as there is no shadow page table. QEMU only
   traps the page table pointer and forwards it to host kernel.
- UNMAP is not needed as QEMU only traps invalidation requests
   and forwards them to host kernel.

(let's forget about the "address space switch" for MSI for now.)

So, in the vSMMU model, there is actually no need for the iommu
AS. And there is only one IOAS in the VM instance allocated by the
VFIO container. And this IOAS manages the GPA->PA mappings. So,
get_address_space() returns the system AS for passthrough devices.

On the other hand, the VT-d model is a bit different. It's a giant
vIOMMU for all devices (either passthrough or emualted). For all
emulated devices, it needs IOMMU_NOTIFIER_IOTLB_EVENTS, i.e. the
iommu address space returned via get_address_space().

That being said, IOMMU_NOTIFIER_IOTLB_EVENTS should not be needed
for passthrough devices, right?


No, even if x-flts=on is configured in QEMU cmdline, that only mean virtual vtd
supports stage-1 translation, guest still can choose to run in legacy 
mode(stage2),
e.g., with kernel cmdline intel_iommu=on,sm_off

So before guest run, we don't know which kind of page table either stage1 or 
stage2
for this VFIO device by guest. So we have to use iommu AS to catch stage2's MAP 
event
if guest choose stage2.


@Zheznzhong, if guest decides to use legacy mode then vIOMMU should switch
the MRs of the device's AS, hence the IOAS created by VFIO container would
be switched to using the IOMMU_NOTIFIER_IOTLB_EVENTS since the MR is
switched to IOMMU MR. So it should be able to support shadowing the guest
IO page table. Hence, this should not be a problem.

@Nicolin, I think your major point is making the VFIO container IOAS as a
GPA IOAS (always return system AS in get_address_space op) and reusing it
when setting nested translation. Is it? I think it should work if:
1) we can let the vfio memory listener filter out the RO pages per vIOMMU's
   request. But I don't want the get_address_space op always return system
   AS as the reason mentioned by Zhenzhong above.
2) we can disallow emulated/passthru devices behind the same pcie-pci
   bridge[1]. For emulated devices, AS should switch to iommu MR, while for
   passthru devices, it needs the AS stick with the system MR hence be able
   to keep the VFIO container IOAS as a GPA IOAS. To support this, let AS
   switch to iommu MR and have a separate GPA IOAS is needed. This separate
   GPA IOAS can be shared by all the passthru devices.

[1]https://lore.kernel.org/all/sj0pr11mb6744e2ba00bbe677b2b49be992...@sj0pr11mb6744.namprd11.prod.outlook.com/#t

So basically, we are ok with your idea. But we should decide if it isnecessary to support the topology in 2). I think this is a general

question. TBH. I don't have much information to judge if it is valuable.
Perhaps, let's hear from more people.


IIUIC, in the VT-d model, a passthrough device also gets attached
to the VFIO container via iommufd_cdev_attach, allocating an IOAS.
But it returns the iommu address space, treating them like those
emulated devices, although the underlying MR of the returned IOMMU
AS is backed by a nodmar MR (that is essentially a system AS).

This seems to completely ignore the default IOAS owned by the VFIO
container, because it needs to bypass those RO mappings(?)

Then for passthrough devices, the VT-d model allocates an internal
IOAS that further requires an internal S2 listener, which seems an
large duplication of what the VFIO container already does..

So, here are things that I want us to conclude:
1) Since the VFIO container already has an IOAS for a passthrough
    device, and IOMMU_NOTIFIER_IOTLB_EVENTS isn't seemingly needed,
    why not setup this default IOAS to manage gPA=>PA mappings by
    returning the system AS via get_address_space() for passthrough
    devices?

    I got that the VT-d model might have some concern against this,
    as the default listener would map those RO regions. Yet, maybe
    the right approach is to figure out a way to bypass RO regions
    in the core v.s. duplicating another ioas_alloc()/map() and S2
    listener?

2) If (1) makes sense, I think we can further simplify the routine
    by allocating a nesting parent HWPT in iommufd_cdev_attach(),
    as long as the attaching device is identified as "passthrough"
    and there is "iommufd" in its "-device" string?

    After all, IOMMU_HWPT_ALLOC_NEST_PARENT is a common flag.

On Mon, May 26, 2025 at 03:24:50PM +0800, Yi Liu wrote:

vfio_listener_region_add, section->mr->name: pc.bios, iova: fffc0000, size:
40000, vaddr: 7fb314200000, RO
vfio_listener_region_add, section->mr->name: pc.rom, iova: c0000, size:
20000, vaddr: 7fb206c00000, RO

..

vfio_listener_region_add, section->mr->name: pc.ram, iova: ce000, size:
1a000, vaddr: 7fb207ece000, RO


OK. They look like memory carveouts for FWs. "iova" is gPA right?

And they can be in the range of a guest RAM..

Mind elaborating why they shouldn't be mapped onto nesting parent
S2?


@Nicolin, It's due to ERRATA_772415.

IMHO. At least for vfio devices, I can see only one get_address_space()
call. So even there are two ASs, how should the vfio be notified when the
AS changed? Since vIOMMU is the source of map/umap requests, it looks fine
to always return iommu AS and handle the AS switch by switching the enabled
subregions according to the guest vIOMMU translation types.


No, VFIO doesn't get notified when the AS changes.

The vSMMU model wants VFIO to stay in the system AS since the VFIO
container manages the S2 mappings for guest PA.

The "switch" in vSMMU model is only needed by KVM for MSI doorbell
translation. By thinking it carefully, maybe it shouldn't switch AS
because VFIO might be confused if it somehow does get_address_space
again in the future..


@Nicolin, not quite get the detailed logic for the MSI stuff on SMMU. But I
agree with the last sentence. get_address_space should return a consistent
AS.

--
Regards,
Yi Liu

Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host

Reply via email to