This is the v6 series of the shared device assignment support.

Compared with the last version [1], this series retains the basic support
and removes the additional complex error handling, which can be added
back when necessary. Meanwhile, the patchset has been re-organized to
be clearer.

Overview of this series:

- Patch 1-3: Preparation patches. These include function exposure and
  some function prototype changes.
- Patch 4: Introduce a new object to implement RamDiscardManager
  interface and a helper to notify the shared/private state change.
- Patch 5: Enable coordinated discarding of RAM with guest_memfd through
  the RamDiscardManager interface.

More small changes or details can be found in the individual patches.

---

Background
==========
Confidential VMs have two classes of memory: shared and private memory.
Shared memory is accessible from the host/VMM while private memory is
not. Confidential VMs can decide which memory is shared/private and
convert memory between shared and private at runtime.

"guest_memfd" is a new kind of fd whose primary goal is to serve guest
private memory. In current implementation, shared memory is allocated
with normal methods (e.g. mmap or fallocate) while private memory is
allocated from guest_memfd. When a VM performs memory conversions, QEMU
frees pages via madvise or via PUNCH_HOLE on memfd or guest_memfd from
one side, and allocates new pages from the other side. This will cause a
stale IOMMU mapping issue mentioned in [2] when we try to enable shared
device assignment in confidential VMs.

Solution
========
The key to enable shared device assignment is to update the IOMMU mappings
on page conversion. RamDiscardManager, an existing interface currently
utilized by virtio-mem, offers a means to modify IOMMU mappings in
accordance with VM page assignment. Page conversions is similar to
hot-removing a page in one mode and adding it back in the other.

This series implements a RamDiscardmanager for confidential VMs and
utilizes its infrastructure to notify VFIO of page conversions.

Limitation and future extension
===============================
This series only supports the basic shared device assignment functionality.
There are still some limitations and areas that can be extended and
optimized in the future.

Relationship with in-place conversion
-------------------------------------
In-place page conversion is the ongoing work to allow mmap() of
guest_memfd to userspace so that both private and shared memory can use
the same physical memory as the backend. This new design eliminates the
need to discard pages during shared/private conversions. When it is
ready, shared device assignment needs be adjusted to achieve an
unmap-before-conversion-to-private and map-after-conversion-to-shared
sequence to be compatible with the change.

Partial unmap limitation
------------------------
VFIO expects the DMA mapping for a specific IOVA to be mapped and
unmapped with the same granularity. The guest may perform partial
conversion, such as converting a small region within a larger one. To
prevent such invalid cases, current operations are performed with 4K
granularity. This could be optimized after DMA mapping cut operation
[3] is introduced in the future. We can always perform a
split-before-unmap if partial conversions happens. If the split
succeeds, the unmap will succeed and be atomic. If the split fails, the
unmap process fails.

More attributes management
--------------------------
Current RamDiscardManager can only manage a pair of opposite states like
populated/discared or shared/private. If more states need to be
considered, for example, support virtio-mem in confidential VMs, three
states would be possible (shared populated/private populated/discard).
Current framework cannot handle such scenario and we need to think of
some new framework at that time [4].

Memory overhead optimization
----------------------------
A comment from Baolu [5] suggests considering using Maple Tree or a generic
interval tree to manage private/shared state instead of a bitmap, which
can reduce memory consumption. This optmization can also be considered
in other bitmap use cases like dirty bitmaps for guest RAM.

Testing
=======
This patch series is tested based on mainline kernel since TDX base
support has been merged. The QEMU repo is available at
QEMU: 
https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream-snapshot-2025-05-30-v2

To facilitate shared device assignment with the NIC, employ the legacy
type1 VFIO with the QEMU command:

qemu-system-x86_64 [...]
    -device vfio-pci,host=XX:XX.X

The parameter of dma_entry_limit needs to be adjusted. For example, a
16GB guest needs to adjust the parameter like
vfio_iommu_type1.dma_entry_limit=4194304.

If use the iommufd-backed VFIO with the qemu command:

qemu-system-x86_64 [...]
    -object iommufd,id=iommufd0 \
    -device vfio-pci,host=XX:XX.X,iommufd=iommufd0


Because the new features like cut_mapping operation will only be support in 
iommufd.
It is more recommended to use the iommufd-backed VFIO.

Related link
============
[1] 
https://lore.kernel.org/qemu-devel/20250520102856.132417-1-chenyi.qi...@intel.com/
[2] 
https://lore.kernel.org/qemu-devel/20240423150951.41600-54-pbonz...@redhat.com/
[3] 
https://lore.kernel.org/linux-iommu/0-v2-5c26bde5c22d+58b-iommu_pt_...@nvidia.com/
[4] 
https://lore.kernel.org/qemu-devel/d1a71e00-243b-4751-ab73-c05a4e090...@redhat.com/
[5] 
https://lore.kernel.org/qemu-devel/013b36a9-9310-4073-b54c-9c511f23d...@linux.intel.com/

Chenyi Qiang (5):
  memory: Export a helper to get intersection of a MemoryRegionSection
    with a given range
  memory: Change memory_region_set_ram_discard_manager() to return the
    result
  memory: Unify the definiton of ReplayRamPopulate() and
    ReplayRamDiscard()
  ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock
    with guest_memfd
  physmem: Support coordinated discarding of RAM with guest_memfd

 MAINTAINERS                   |   1 +
 accel/kvm/kvm-all.c           |   9 +
 hw/virtio/virtio-mem.c        |  83 +++---
 include/system/memory.h       | 100 +++++--
 include/system/ramblock.h     |  22 ++
 migration/ram.c               |   5 +-
 system/memory.c               |  22 +-
 system/meson.build            |   1 +
 system/physmem.c              |  18 +-
 system/ram-block-attributes.c | 480 ++++++++++++++++++++++++++++++++++
 system/trace-events           |   3 +
 11 files changed, 660 insertions(+), 84 deletions(-)
 create mode 100644 system/ram-block-attributes.c

-- 
2.43.5


Reply via email to