On 30.05.25 10:32, Chenyi Qiang wrote:
Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
discard") highlighted that subsystems like VFIO may disable RAM block
discard. However, guest_memfd relies on discard operations for page
conversion between private and shared memory, potentially leading to
the stale IOMMU mapping issue when assigning hardware devices to
confidential VMs via shared memory. To address this and allow shared
device assignement, it is crucial to ensure the VFIO system refreshes
its IOMMU mappings.
RamDiscardManager is an existing interface (used by virtio-mem) to
adjust VFIO mappings in relation to VM page assignment. Effectively page
conversion is similar to hot-removing a page in one mode and adding it
back in the other. Therefore, similar actions are required for page
conversion events. Introduce the RamDiscardManager to guest_memfd to
facilitate this process.
Since guest_memfd is not an object, it cannot directly implement the
RamDiscardManager interface. Implementing it in HostMemoryBackend is
not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks
have a memory backend while others do not. Notably, virtual BIOS
RAMBlocks using memory_region_init_ram_guest_memfd() do not have a
backend.
To manage RAMBlocks with guest_memfd, define a new object named
RamBlockAttributes to implement the RamDiscardManager interface. This
object can store the guest_memfd information such as bitmap for shared
memory and the registered listeners for event notification. In the
context of RamDiscardManager, shared state is analogous to populated, and
private state is signified as discarded. To notify the conversion events,
a new state_change() helper is exported for the users to notify the
listeners like VFIO, so that VFIO can dynamically DMA map/unmap the
shared mapping.
Note that the memory state is tracked at the host page size granularity,
as the minimum conversion size can be one page per request and VFIO
expects the DMA mapping for a specific iova to be mapped and unmapped
with the same granularity. Confidential VMs may perform partial
conversions, such as conversions on small regions within larger ones.
To prevent such invalid cases and until DMA mapping cut operation
support is available, all operations are performed with 4K granularity.
In addition, memory conversion failures cause QEMU to quit instead of
resuming the guest or retrying the operation at present. It would be
future work to add more error handling or rollback mechanisms once
conversion failures are allowed. For example, in-place conversion of
guest_memfd could retry the unmap operation during the conversion from
shared to private. For now, keep the complex error handling out of the
picture as it is not required.
Signed-off-by: Chenyi Qiang <chenyi.qi...@intel.com>
---
Changes in v6:
- Change the object type name from RamBlockAttribute to
RamBlockAttributes. (David)
- Save the associated RAMBlock instead MemoryRegion in
RamBlockAttributes. (David)
- Squash the state_change() helper introduction in this commit as
well as the mixture conversion case handling. (David)
- Change the block_size type from int to size_t and some cleanup in
validation check. (Alexey)
- Add a tracepoint to track the state changes. (Alexey)
Changes in v5:
- Revert to use RamDiscardManager interface instead of introducing
new hierarchy of class to manage private/shared state, and keep
using the new name of RamBlockAttribute compared with the
MemoryAttributeManager in v3.
- Use *simple* version of object_define and object_declare since the
state_change() function is changed as an exported function instead
of a virtual function in later patch.
- Move the introduction of RamBlockAttribute field to this patch and
rename it to ram_shared. (Alexey)
- call the exit() when register/unregister failed. (Zhao)
- Add the ram-block-attribute.c to Memory API related part in
MAINTAINERS.
Changes in v4:
- Change the name from memory-attribute-manager to
ram-block-attribute.
- Implement the newly-introduced PrivateSharedManager instead of
RamDiscardManager and change related commit message.
- Define the new object in ramblock.h instead of adding a new file.
---
MAINTAINERS | 1 +
include/system/ramblock.h | 21 ++
system/meson.build | 1 +
system/ram-block-attributes.c | 480 ++++++++++++++++++++++++++++++++++
system/trace-events | 3 +
5 files changed, 506 insertions(+)
create mode 100644 system/ram-block-attributes.c
diff --git a/MAINTAINERS b/MAINTAINERS
index 6dacd6d004..8ec39aa7f8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3149,6 +3149,7 @@ F: system/memory.c
F: system/memory_mapping.c
F: system/physmem.c
F: system/memory-internal.h
+F: system/ram-block-attributes.c
F: scripts/coccinelle/memory-region-housekeeping.cocci
Memory devices
diff --git a/include/system/ramblock.h b/include/system/ramblock.h
index d8a116ba99..1bab9e2dac 100644
--- a/include/system/ramblock.h
+++ b/include/system/ramblock.h
@@ -22,6 +22,10 @@
#include "exec/cpu-common.h"
#include "qemu/rcu.h"
#include "exec/ramlist.h"
+#include "system/hostmem.h"
+
+#define TYPE_RAM_BLOCK_ATTRIBUTES "ram-block-attributes"
+OBJECT_DECLARE_SIMPLE_TYPE(RamBlockAttributes, RAM_BLOCK_ATTRIBUTES)
struct RAMBlock {
struct rcu_head rcu;
@@ -91,4 +95,21 @@ struct RAMBlock {
ram_addr_t postcopy_length;
};
+struct RamBlockAttributes {
+ Object parent;
+
+ RAMBlock *ram_block;
+
+ /* 1-setting of the bitmap represents ram is populated (shared) */
+ unsigned bitmap_size;
+ unsigned long *bitmap;
So, initially, all memory starts out as private, correct?
I guess this mimics what kvm_set_phys_mem() ends up doing, when it does
the kvm_set_memory_attributes_private() call.
So there is a short period of inconsistency, between creating the
RAMBlock and mapping it into the PA space.
It might be wroth spelling that out / documenting it somewhere.
--
Cheers,
David / dhildenb