On Wed, Jan 22, 2025 at 12:43 AM Joao Martins <joao.m.mart...@oracle.com> wrote:
>
> On 07/01/2025 06:55, Zhangfei Gao wrote:
> > Hi, Joao
> >
> > On Fri, Jun 23, 2023 at 5:51 AM Joao Martins <joao.m.mart...@oracle.com> 
> > wrote:
> >>
> >> Hey,
> >>
> >> This series introduces support for vIOMMU with VFIO device migration,
> >> particurlarly related to how we do the dirty page tracking.
> >>
> >> Today vIOMMUs serve two purposes: 1) enable interrupt remaping 2)
> >> provide dma translation services for guests to provide some form of
> >> guest kernel managed DMA e.g. for nested virt based usage; (1) is specially
> >> required for big VMs with VFs with more than 255 vcpus. We tackle both
> >> and remove the migration blocker when vIOMMU is present provided the
> >> conditions are met. I have both use-cases here in one series, but I am 
> >> happy
> >> to tackle them in separate series.
> >>
> >> As I found out we don't necessarily need to expose the whole vIOMMU
> >> functionality in order to just support interrupt remapping. x86 IOMMUs
> >> on Windows Server 2018[2] and Linux >=5.10, with qemu 7.1+ (or really
> >> Linux guests with commit c40aaaac10 and since qemu commit 8646d9c773d8)
> >> can instantiate a IOMMU just for interrupt remapping without needing to
> >> be advertised/support DMA translation. AMD IOMMU in theory can provide
> >> the same, but Linux doesn't quite support the IR-only part there yet,
> >> only intel-iommu.
> >>
> >> The series is organized as following:
> >>
> >> Patches 1-5: Today we can't gather vIOMMU details before the guest
> >> establishes their first DMA mapping via the vIOMMU. So these first four
> >> patches add a way for vIOMMUs to be asked of their properties at start
> >> of day. I choose the least churn possible way for now (as opposed to a
> >> treewide conversion) and allow easy conversion a posteriori. As
> >> suggested by Peter Xu[7], I have ressurected Yi's patches[5][6] which
> >> allows us to fetch PCI backing vIOMMU attributes, without necessarily
> >> tieing the caller (VFIO or anyone else) to an IOMMU MR like I
> >> was doing in v3.
> >>
> >> Patches 6-8: Handle configs with vIOMMU interrupt remapping but without
> >> DMA translation allowed. Today the 'dma-translation' attribute is
> >> x86-iommu only, but the way this series is structured nothing stops from
> >> other vIOMMUs supporting it too as long as they use
> >> pci_setup_iommu_ops() and the necessary IOMMU MR get_attr attributes
> >> are handled. The blocker is thus relaxed when vIOMMUs are able to toggle
> >> the toggle/report DMA_TRANSLATION attribute. With the patches up to this 
> >> set,
> >> we've then tackled item (1) of the second paragraph.
> >
> > Not understanding how to handle the device page table.
> >
> > Does this mean after live-migration, the page table built by vIOMMU
> > will be re-build in the target guest via pci_setup_iommu_ops?
>
> AFAIU It is supposed to be done post loading the vIOMMU vmstate when enabling
> the vIOMMU related MRs. And when walking the different 'emulated' address 
> spaces
>  it will replay all mappings (and skip non-present parts of the address 
> space).
>
> The trick in making this work largelly depends on individual vIOMMU
> implementation (and this emulated vIOMMU stuff shouldn't be confused with 
> IOMMU
> nesting btw!). In intel case (and AMD will be similar) the root table pointer
> that's part of the vmstate has all the device pagetables, which is just guest
> memory that gets migrated over and enough to resolve VT-d/IVRS page walks.
>
> The somewhat hard to follow part is that when it replays it walks all the 
> whole
> DMAR memory region and only notifies IOMMU MR listeners if there's a present 
> PTE
> or skip it. So at the end of the enabling of MRs the IOTLB gets reconstructed.
> Though you would have to try to understand the flow with the vIOMMU you are 
> using.
>
> The replay in intel-iommu is triggered more or less this stack trace for a
> present PTE:
>
> vfio_iommu_map_notify
> memory_region_notify_iommu_one
> vtd_replay_hook
> vtd_page_walk_one
> vtd_page_walk_level
> vtd_page_walk_level
> vtd_page_walk_level
> vtd_page_walk
> vtd_iommu_replay
> memory_region_iommu_replay
> vfio_listener_region_add
> address_space_update_topology_pass
> address_space_set_flatview
> memory_region_transaction_commit
> vtd_switch_address_space
> vtd_switch_address_space_all
> vtd_post_load
> vmstate_load_state
> vmstate_load
> qemu_loadvm_section_start_full
> qemu_loadvm_state_main
> qemu_loadvm_state
> process_incoming_migration_co

Thanks Joao for the info

Sorry, some more questions,

When src boots up, the guest kernel will send commands to qemu.
qemu will consume these commands, and trigger

smmuv3_cmdq_consume
smmu_realloc_veventq
smmuv3_cmdq_consume
smmuv3_cmdq_consume SMMU_CMD_CFGI_STE
smmuv3_install_nested_ste
iommufd_backend_alloc_hwpt
host_iommu_device_iommufd_attach_hwpt

After live-migration, the dst does not get these commands, so it does
not call smmuv3_install_nested_ste etc.
so the dma page table is not set up and the kernel reports errors.

Not sure if we need to set up these commands in the dst, or directly
copy the existing page table from src to the dst.

Thanks

Reply via email to