[sorry for the delay, I was out last week] On 10/06/2024 17:53, Cédric Le Goater wrote: > On 6/7/24 5:10 PM, Joao Martins wrote: >> On 06/06/2024 16:43, Cédric Le Goater wrote: >>> Hello Joao, >>> >>> On 6/22/23 23:48, Joao Martins wrote: >>>> Hey, >>>> >>>> This series introduces support for vIOMMU with VFIO device migration, >>>> particurlarly related to how we do the dirty page tracking. >>>> >>>> Today vIOMMUs serve two purposes: 1) enable interrupt remaping 2) >>>> provide dma translation services for guests to provide some form of >>>> guest kernel managed DMA e.g. for nested virt based usage; (1) is specially >>>> required for big VMs with VFs with more than 255 vcpus. We tackle both >>>> and remove the migration blocker when vIOMMU is present provided the >>>> conditions are met. I have both use-cases here in one series, but I am >>>> happy >>>> to tackle them in separate series. >>>> >>>> As I found out we don't necessarily need to expose the whole vIOMMU >>>> functionality in order to just support interrupt remapping. x86 IOMMUs >>>> on Windows Server 2018[2] and Linux >=5.10, with qemu 7.1+ (or really >>>> Linux guests with commit c40aaaac10 and since qemu commit 8646d9c773d8) >>>> can instantiate a IOMMU just for interrupt remapping without needing to >>>> be advertised/support DMA translation. AMD IOMMU in theory can provide >>>> the same, but Linux doesn't quite support the IR-only part there yet, >>>> only intel-iommu. >>>> >>>> The series is organized as following: >>>> >>>> Patches 1-5: Today we can't gather vIOMMU details before the guest >>>> establishes their first DMA mapping via the vIOMMU. So these first four >>>> patches add a way for vIOMMUs to be asked of their properties at start >>>> of day. I choose the least churn possible way for now (as opposed to a >>>> treewide conversion) and allow easy conversion a posteriori. As >>>> suggested by Peter Xu[7], I have ressurected Yi's patches[5][6] which >>>> allows us to fetch PCI backing vIOMMU attributes, without necessarily >>>> tieing the caller (VFIO or anyone else) to an IOMMU MR like I >>>> was doing in v3. >>>> >>>> Patches 6-8: Handle configs with vIOMMU interrupt remapping but without >>>> DMA translation allowed. Today the 'dma-translation' attribute is >>>> x86-iommu only, but the way this series is structured nothing stops from >>>> other vIOMMUs supporting it too as long as they use >>>> pci_setup_iommu_ops() and the necessary IOMMU MR get_attr attributes >>>> are handled. The blocker is thus relaxed when vIOMMUs are able to toggle >>>> the toggle/report DMA_TRANSLATION attribute. With the patches up to this >>>> set, >>>> we've then tackled item (1) of the second paragraph. >>>> >>>> Patches 9-15: Simplified a lot from v2 (patch 9) to only track the complete >>>> IOVA address space, leveraging the logic we use to compose the dirty >>>> ranges. >>>> The blocker is once again relaxed for vIOMMUs that advertise their IOVA >>>> addressing limits. This tackles item (2). So far I mainly use it with >>>> intel-iommu, although I have a small set of patches for virtio-iommu per >>>> Alex's suggestion in v2. >>>> >>>> Comments, suggestions welcome. Thanks for the review! >>> >>> >>> I spent sometime refreshing your series on upstream QEMU (See [1]) and >>> gave migration a try with CX-7 VF. LGTM. It doesn't seem we are far >>> from acceptance in QEMU 9.1. Are we ? >>> >> Yeah. >> >> There was a comment from Zhenzhong on the vfio_viommu_preset() here[0]. But I >> was looking at that to remind myself what was it that we had to change, but >> even >> with re-reading the thread I can't spot any flaw that needs change. >> >> [0] >> https://lore.kernel.org/qemu-devel/de2b72d2-f56b-9350-ce0f-70edfb58e...@intel.com/#r > > I introduced a vfio_devices_all_viommu_preset() routine to check all devices > in a container and a simplified version of vfio_viommu_get_max_iova() > returning the space max_iova. >
/me nods > >>> First, I will resend these with the changes I made : >>> >>> vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap() >>> vfio/common: Move dirty tracking ranges update to helper() >>> >>> I guess the PCIIOMMUOps::get_iommu_attr needs a close review. Is >>> IOMMU_ATTR_DMA_TRANSLATION a must have ? >>> >> It's sort of the 'correct way' of relaxing vIOMMU checks, because you are >> 100% >> guaranteed that the guest won't do DMA. The other outstanding thing related >> to >> that is for older kernels which is to use the directmap for dirty page >> tracking, >> but the moment a mapping is attempted the migration doesn't start or if it's >> in >> progress it gets aborted[*]: >> >> https://lore.kernel.org/qemu-devel/20230908120521.50903-1-joao.m.mart...@oracle.com/ >> >> The above link and DMA_TRANSLATION is mostly for the usecase we use that only >> cares about vIOMMU for interrupt remapping only and no DMA translation >> services. >> But we can't just disable dma-translation in qemu because it may crash older >> kernels, so it supports both old and new this way. >> >> [*] Recently I noticed you improved error reporting, so >> vfio_set_migration_error(-EOPNOTSUPP) probably has a better way of getting >> there. > > Yes. So, I did a little more change to improve vfio_dirty_tracking_init(). > /me nods >>> The rest is mostly VFIO internals for dirty tracking. >>> >> Right. >> >> I derailed with other work and also stuff required for iommu dirty tracking >> that >> I forgot about these patches, sorry. > > That's fine. > > I am trying to sort out which patches could be merged in advance for QEMU 9.1 > and your series has shrunk a lot since it was last sent. I might resend the > whole to cherry pick the simple changes and get some R-b tags. > OK, sounds good. > > For the record, here is my watch list: > > QEMU series under review: > > * [v7] Add a host IOMMU device abstraction to check with vIOMMU > https://lore.kernel.org/all/20240605083043.317831-1-zhenzhong.d...@intel.com > > Needs feedback on the PCI IOMMU ops. vIOMMU "iommufd" property ? Part of the reason I suggested splitting it to allow other things to progress as the IOMMU ops is related to the nesting. > Pushed on vfio-9.1 branch. > * [RFC v2] VIRTIO-IOMMU/VFIO: Fix host iommu geometry > https://lore.kernel.org/all/20240607143905.765133-1-eric.au...@redhat.com > Pushed on vfio-9.1 branch. > > Need a resend : > > * [v4] vfio: VFIO migration support with vIOMMU > > https://lore.kernel.org/qemu-devel/20230622214845.3980-1-joao.m.mart...@oracle.com/ > Refreshed the patchset on upstream and pushed on vfio-9.1 branch. /me nods Probably deserves an item on the list too related to this subject of vIOMMU and migration after the vIOMMU series is done: * https://lore.kernel.org/qemu-devel/20230908120521.50903-1-joao.m.mart...@oracle.com/ > * [RFCv2] vfio/iommufd: IOMMUFD Dirty Tracking > > https://lore.kernel.org/qemu-devel/20240212135643.5858-1-joao.m.mart...@oracle.com/ > I plan on still submitting a follow-up targetting 9.1 likely next week with Avihai's comments on top of the vfio-9.1 branch after I sent some dirty tracking fixes in kernel side. Though it is mostly to progress review as I think I am still dependent on Zhenzhong prep series for merging because of this patch: https://lore.kernel.org/all/20240605083043.317831-8-zhenzhong.d...@intel.com/ > * [v1] vfio: container: Fix missing allocation of VFIOSpaprContainer > > https://lore.kernel.org/qemu-devel/171528203026.8420.10620440513237875837.st...@ltcd48-lp2.aus.stglabs.ibm.com/ > > Other interesting series (IOMMU related): > > * [rfcv2] intel_iommu: Enable stage-1 translation for emulated device > > https://lore.kernel.org/all/20240522062313.453317-1-zhenzhong.d...@intel.com/ > > * [PATCH ats_vtd v2 00/25] ATS support for VT-d > > https://lore.kernel.org/all/20240515071057.33990-1-clement.mathieu--d...@eviden.com/ > > * [RFC v3] SMMUv3 nested translation support > > https://lore.kernel.org/qemu-devel/20240429032403.74910-1-smost...@google.com/ > > * [PATCH RFCv1 00/14] Add Tegra241 (Grace) CMDQV Support (part 2/2) > https://lore.kernel.org/all/cover.1712978212.git.nicol...@nvidia.com/ > > Yet to be sent, > https://github.com/nicolinc/qemu/commits/wip/iommufd_vcmdq/ > > * [RFC] Multifd device state transfer support with VFIO consumer > > https://lore.kernel.org/all/cover.1713269378.git.maciej.szmigi...@oracle.com/ > I think Maciej (CC'ed) plans on submitting a -- simplified I think? -- v2 of this one shortly still targetting 9.1. But it is largely migration core changes with the last two patches on vfio. Joao