This is the v6 series of the shared device assignment support. Compared with the last version [1], this series retains the basic support and removes the additional complex error handling, which can be added back when necessary. Meanwhile, the patchset has been re-organized to be clearer.
Overview of this series: - Patch 1-3: Preparation patches. These include function exposure and some function prototype changes. - Patch 4: Introduce a new object to implement RamDiscardManager interface and a helper to notify the shared/private state change. - Patch 5: Enable coordinated discarding of RAM with guest_memfd through the RamDiscardManager interface. More small changes or details can be found in the individual patches. --- Background ========== Confidential VMs have two classes of memory: shared and private memory. Shared memory is accessible from the host/VMM while private memory is not. Confidential VMs can decide which memory is shared/private and convert memory between shared and private at runtime. "guest_memfd" is a new kind of fd whose primary goal is to serve guest private memory. In current implementation, shared memory is allocated with normal methods (e.g. mmap or fallocate) while private memory is allocated from guest_memfd. When a VM performs memory conversions, QEMU frees pages via madvise or via PUNCH_HOLE on memfd or guest_memfd from one side, and allocates new pages from the other side. This will cause a stale IOMMU mapping issue mentioned in [2] when we try to enable shared device assignment in confidential VMs. Solution ======== The key to enable shared device assignment is to update the IOMMU mappings on page conversion. RamDiscardManager, an existing interface currently utilized by virtio-mem, offers a means to modify IOMMU mappings in accordance with VM page assignment. Page conversions is similar to hot-removing a page in one mode and adding it back in the other. This series implements a RamDiscardmanager for confidential VMs and utilizes its infrastructure to notify VFIO of page conversions. Limitation and future extension =============================== This series only supports the basic shared device assignment functionality. There are still some limitations and areas that can be extended and optimized in the future. Relationship with in-place conversion ------------------------------------- In-place page conversion is the ongoing work to allow mmap() of guest_memfd to userspace so that both private and shared memory can use the same physical memory as the backend. This new design eliminates the need to discard pages during shared/private conversions. When it is ready, shared device assignment needs be adjusted to achieve an unmap-before-conversion-to-private and map-after-conversion-to-shared sequence to be compatible with the change. Partial unmap limitation ------------------------ VFIO expects the DMA mapping for a specific IOVA to be mapped and unmapped with the same granularity. The guest may perform partial conversion, such as converting a small region within a larger one. To prevent such invalid cases, current operations are performed with 4K granularity. This could be optimized after DMA mapping cut operation [3] is introduced in the future. We can always perform a split-before-unmap if partial conversions happens. If the split succeeds, the unmap will succeed and be atomic. If the split fails, the unmap process fails. More attributes management -------------------------- Current RamDiscardManager can only manage a pair of opposite states like populated/discared or shared/private. If more states need to be considered, for example, support virtio-mem in confidential VMs, three states would be possible (shared populated/private populated/discard). Current framework cannot handle such scenario and we need to think of some new framework at that time [4]. Memory overhead optimization ---------------------------- A comment from Baolu [5] suggests considering using Maple Tree or a generic interval tree to manage private/shared state instead of a bitmap, which can reduce memory consumption. This optmization can also be considered in other bitmap use cases like dirty bitmaps for guest RAM. Testing ======= This patch series is tested based on mainline kernel since TDX base support has been merged. The QEMU repo is available at QEMU: https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream-snapshot-2025-05-30-v2 To facilitate shared device assignment with the NIC, employ the legacy type1 VFIO with the QEMU command: qemu-system-x86_64 [...] -device vfio-pci,host=XX:XX.X The parameter of dma_entry_limit needs to be adjusted. For example, a 16GB guest needs to adjust the parameter like vfio_iommu_type1.dma_entry_limit=4194304. If use the iommufd-backed VFIO with the qemu command: qemu-system-x86_64 [...] -object iommufd,id=iommufd0 \ -device vfio-pci,host=XX:XX.X,iommufd=iommufd0 Because the new features like cut_mapping operation will only be support in iommufd. It is more recommended to use the iommufd-backed VFIO. Related link ============ [1] https://lore.kernel.org/qemu-devel/20250520102856.132417-1-chenyi.qi...@intel.com/ [2] https://lore.kernel.org/qemu-devel/20240423150951.41600-54-pbonz...@redhat.com/ [3] https://lore.kernel.org/linux-iommu/0-v2-5c26bde5c22d+58b-iommu_pt_...@nvidia.com/ [4] https://lore.kernel.org/qemu-devel/d1a71e00-243b-4751-ab73-c05a4e090...@redhat.com/ [5] https://lore.kernel.org/qemu-devel/013b36a9-9310-4073-b54c-9c511f23d...@linux.intel.com/ Chenyi Qiang (5): memory: Export a helper to get intersection of a MemoryRegionSection with a given range memory: Change memory_region_set_ram_discard_manager() to return the result memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard() ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock with guest_memfd physmem: Support coordinated discarding of RAM with guest_memfd MAINTAINERS | 1 + accel/kvm/kvm-all.c | 9 + hw/virtio/virtio-mem.c | 83 +++--- include/system/memory.h | 100 +++++-- include/system/ramblock.h | 22 ++ migration/ram.c | 5 +- system/memory.c | 22 +- system/meson.build | 1 + system/physmem.c | 18 +- system/ram-block-attributes.c | 480 ++++++++++++++++++++++++++++++++++ system/trace-events | 3 + 11 files changed, 660 insertions(+), 84 deletions(-) create mode 100644 system/ram-block-attributes.c -- 2.43.5