On 6/5/2025 8:35 AM, Alexey Kardashevskiy wrote:
>
>
> On 4/6/25 21:04, Alexey Kardashevskiy wrote:
>>
>>
>> On 30/5/25 18:32, Chenyi Qiang wrote:
>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>>> discard") highlighted that subsystems like VFIO may disable RAM block
>>> discard. However, guest_memfd relies on discard operations for page
>>> conversion between private and shared memory, potentially leading to
>>> the stale IOMMU mapping issue when assigning hardware devices to
>>> confidential VMs via shared memory. To address this and allow shared
>>> device assignement, it is crucial to ensure the VFIO system refreshes
>>> its IOMMU mappings.
>>>
>>> RamDiscardManager is an existing interface (used by virtio-mem) to
>>> adjust VFIO mappings in relation to VM page assignment. Effectively page
>>> conversion is similar to hot-removing a page in one mode and adding it
>>> back in the other. Therefore, similar actions are required for page
>>> conversion events. Introduce the RamDiscardManager to guest_memfd to
>>> facilitate this process.
>>>
>>> Since guest_memfd is not an object, it cannot directly implement the
>>> RamDiscardManager interface. Implementing it in HostMemoryBackend is
>>> not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks
>>> have a memory backend while others do not. Notably, virtual BIOS
>>> RAMBlocks using memory_region_init_ram_guest_memfd() do not have a
>>> backend.
>>>
>>> To manage RAMBlocks with guest_memfd, define a new object named
>>> RamBlockAttributes to implement the RamDiscardManager interface. This
>>> object can store the guest_memfd information such as bitmap for shared
>>> memory and the registered listeners for event notification. In the
>>> context of RamDiscardManager, shared state is analogous to populated,
>>> and
>>> private state is signified as discarded. To notify the conversion
>>> events,
>>> a new state_change() helper is exported for the users to notify the
>>> listeners like VFIO, so that VFIO can dynamically DMA map/unmap the
>>> shared mapping.
>>>
>>> Note that the memory state is tracked at the host page size granularity,
>>> as the minimum conversion size can be one page per request and VFIO
>>> expects the DMA mapping for a specific iova to be mapped and unmapped
>>> with the same granularity. Confidential VMs may perform partial
>>> conversions, such as conversions on small regions within larger ones.
>>> To prevent such invalid cases and until DMA mapping cut operation
>>> support is available, all operations are performed with 4K granularity.
>>>
>>> In addition, memory conversion failures cause QEMU to quit instead of
>>> resuming the guest or retrying the operation at present. It would be
>>> future work to add more error handling or rollback mechanisms once
>>> conversion failures are allowed. For example, in-place conversion of
>>> guest_memfd could retry the unmap operation during the conversion from
>>> shared to private. For now, keep the complex error handling out of the
>>> picture as it is not required.
>>>
>>> Signed-off-by: Chenyi Qiang <chenyi.qi...@intel.com>
>>> ---
>>> Changes in v6:
>>> - Change the object type name from RamBlockAttribute to
>>> RamBlockAttributes. (David)
>>> - Save the associated RAMBlock instead MemoryRegion in
>>> RamBlockAttributes. (David)
>>> - Squash the state_change() helper introduction in this commit as
>>> well as the mixture conversion case handling. (David)
>>> - Change the block_size type from int to size_t and some cleanup in
>>> validation check. (Alexey)
>>> - Add a tracepoint to track the state changes. (Alexey)
>>>
>>> Changes in v5:
>>> - Revert to use RamDiscardManager interface instead of introducing
>>> new hierarchy of class to manage private/shared state, and keep
>>> using the new name of RamBlockAttribute compared with the
>>> MemoryAttributeManager in v3.
>>> - Use *simple* version of object_define and object_declare since
>>> the
>>> state_change() function is changed as an exported function
>>> instead
>>> of a virtual function in later patch.
>>> - Move the introduction of RamBlockAttribute field to this patch
>>> and
>>> rename it to ram_shared. (Alexey)
>>> - call the exit() when register/unregister failed. (Zhao)
>>> - Add the ram-block-attribute.c to Memory API related part in
>>> MAINTAINERS.
>>>
>>> Changes in v4:
>>> - Change the name from memory-attribute-manager to
>>> ram-block-attribute.
>>> - Implement the newly-introduced PrivateSharedManager instead of
>>> RamDiscardManager and change related commit message.
>>> - Define the new object in ramblock.h instead of adding a new file.
>>> ---
>>> MAINTAINERS | 1 +
>>> include/system/ramblock.h | 21 ++
>>> system/meson.build | 1 +
>>> system/ram-block-attributes.c | 480 ++++++++++++++++++++++++++++++++++
>>> system/trace-events | 3 +
>>> 5 files changed, 506 insertions(+)
>>> create mode 100644 system/ram-block-attributes.c
>>>
>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>> index 6dacd6d004..8ec39aa7f8 100644
>>> --- a/MAINTAINERS
>>> +++ b/MAINTAINERS
>>> @@ -3149,6 +3149,7 @@ F: system/memory.c
>>> F: system/memory_mapping.c
>>> F: system/physmem.c
>>> F: system/memory-internal.h
>>> +F: system/ram-block-attributes.c
>>> F: scripts/coccinelle/memory-region-housekeeping.cocci
>>> Memory devices
>>> diff --git a/include/system/ramblock.h b/include/system/ramblock.h
>>> index d8a116ba99..1bab9e2dac 100644
>>> --- a/include/system/ramblock.h
>>> +++ b/include/system/ramblock.h
>>> @@ -22,6 +22,10 @@
>>> #include "exec/cpu-common.h"
>>> #include "qemu/rcu.h"
>>> #include "exec/ramlist.h"
>>> +#include "system/hostmem.h"
>>> +
>>> +#define TYPE_RAM_BLOCK_ATTRIBUTES "ram-block-attributes"
>>> +OBJECT_DECLARE_SIMPLE_TYPE(RamBlockAttributes, RAM_BLOCK_ATTRIBUTES)
>>> struct RAMBlock {
>>> struct rcu_head rcu;
>>> @@ -91,4 +95,21 @@ struct RAMBlock {
>>> ram_addr_t postcopy_length;
>>> };
>>> +struct RamBlockAttributes {
>>> + Object parent;
>>> +
>>> + RAMBlock *ram_block;
>>> +
>>> + /* 1-setting of the bitmap represents ram is populated (shared) */
>>> + unsigned bitmap_size;
>>> + unsigned long *bitmap;
>>> +
>>> + QLIST_HEAD(, RamDiscardListener) rdl_list;
>>> +};
>>> +
>>> +RamBlockAttributes *ram_block_attributes_create(RAMBlock *ram_block);
>>> +void ram_block_attributes_destroy(RamBlockAttributes *attr);
>>> +int ram_block_attributes_state_change(RamBlockAttributes *attr,
>>> uint64_t offset,
>>> + uint64_t size, bool to_discard);
>>> +
>>> #endif
>>> diff --git a/system/meson.build b/system/meson.build
>>> index c2f0082766..2747dbde80 100644
>>> --- a/system/meson.build
>>> +++ b/system/meson.build
>>> @@ -17,6 +17,7 @@ libsystem_ss.add(files(
>>> 'dma-helpers.c',
>>> 'globals.c',
>>> 'ioport.c',
>>> + 'ram-block-attributes.c',
>>> 'memory_mapping.c',
>>> 'memory.c',
>>> 'physmem.c',
>>> diff --git a/system/ram-block-attributes.c b/system/ram-block-
>>> attributes.c
>>> new file mode 100644
>>> index 0000000000..514252413f
>>> --- /dev/null
>>> +++ b/system/ram-block-attributes.c
>>> @@ -0,0 +1,480 @@
>>> +/*
>>> + * QEMU ram block attributes
>>> + *
>>> + * Copyright Intel
>>> + *
>>> + * Author:
>>> + * Chenyi Qiang <chenyi.qi...@intel.com>
>>> + *
>>> + * This work is licensed under the terms of the GNU GPL, version 2
>>> or later.
>>> + * See the COPYING file in the top-level directory
>>> + *
>>> + */
>>> +
>>> +#include "qemu/osdep.h"
>>> +#include "qemu/error-report.h"
>>> +#include "system/ramblock.h"
>>> +#include "trace.h"
>>> +
>>> +OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(RamBlockAttributes,
>>> + ram_block_attributes,
>>> + RAM_BLOCK_ATTRIBUTES,
>>> + OBJECT,
>>> + { TYPE_RAM_DISCARD_MANAGER },
>>> + { })
>>> +
>>> +static size_t
>>> +ram_block_attributes_get_block_size(const RamBlockAttributes *attr)
>>> +{
>>> + /*
>>> + * Because page conversion could be manipulated in the size of
>>> at least 4K
>>> + * or 4K aligned, Use the host page size as the granularity to
>>> track the
>>> + * memory attribute.
>>> + */
>>> + g_assert(attr && attr->ram_block);
>>> + g_assert(attr->ram_block->page_size == qemu_real_host_page_size());
>>> + return attr->ram_block->page_size;
>>> +}
>>> +
>>> +
>>> +static bool
>>> +ram_block_attributes_rdm_is_populated(const RamDiscardManager *rdm,
>>> + const MemoryRegionSection
>>> *section)
>>> +{
>>> + const RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>>> + const size_t block_size =
>>> ram_block_attributes_get_block_size(attr);
>>> + const uint64_t first_bit = section->offset_within_region /
>>> block_size;
>>> + const uint64_t last_bit = first_bit + int128_get64(section-
>>> >size) / block_size - 1;
>>> + unsigned long first_discarded_bit;
>>> +
>>> + first_discarded_bit = find_next_zero_bit(attr->bitmap, last_bit
>>> + 1,
>>> + first_bit);
>>> + return first_discarded_bit > last_bit;
>>> +}
>>> +
>>> +typedef int (*ram_block_attributes_section_cb)(MemoryRegionSection *s,
>>> + void *arg);
>>> +
>>> +static int
>>> +ram_block_attributes_notify_populate_cb(MemoryRegionSection *section,
>>> + void *arg)
>>> +{
>>> + RamDiscardListener *rdl = arg;
>>> +
>>> + return rdl->notify_populate(rdl, section);
>>> +}
>>> +
>>> +static int
>>> +ram_block_attributes_notify_discard_cb(MemoryRegionSection *section,
>>> + void *arg)
>>> +{
>>> + RamDiscardListener *rdl = arg;
>>> +
>>> + rdl->notify_discard(rdl, section);
>>> + return 0;
>>> +}
>>> +
>>> +static int
>>> +ram_block_attributes_for_each_populated_section(const
>>> RamBlockAttributes *attr,
>>> + MemoryRegionSection
>>> *section,
>>> + void *arg,
>>> +
>>> ram_block_attributes_section_cb cb)
>>> +{
>>> + unsigned long first_bit, last_bit;
>>> + uint64_t offset, size;
>>> + const size_t block_size =
>>> ram_block_attributes_get_block_size(attr);
>>> + int ret = 0;
>>> +
>>> + first_bit = section->offset_within_region / block_size;
>>> + first_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
>>> + first_bit);
>>> +
>>> + while (first_bit < attr->bitmap_size) {
>>> + MemoryRegionSection tmp = *section;
>>> +
>>> + offset = first_bit * block_size;
>>> + last_bit = find_next_zero_bit(attr->bitmap, attr->bitmap_size,
>>> + first_bit + 1) - 1;
>>> + size = (last_bit - first_bit + 1) * block_size;
>>> +
>>> + if (!memory_region_section_intersect_range(&tmp, offset,
>>> size)) {
>>> + break;
>>> + }
>>> +
>>> + ret = cb(&tmp, arg);
>>> + if (ret) {
>>> + error_report("%s: Failed to notify RAM discard listener:
>>> %s",
>>> + __func__, strerror(-ret));
>>> + break;
>>> + }
>>> +
>>> + first_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
>>> + last_bit + 2);
>>> + }
>>> +
>>> + return ret;
>>> +}
>>> +
>>> +static int
>>> +ram_block_attributes_for_each_discarded_section(const
>>> RamBlockAttributes *attr,
>>> + MemoryRegionSection
>>> *section,
>>> + void *arg,
>>> +
>>> ram_block_attributes_section_cb cb)
>>> +{
>>> + unsigned long first_bit, last_bit;
>>> + uint64_t offset, size;
>>> + const size_t block_size =
>>> ram_block_attributes_get_block_size(attr);
>>> + int ret = 0;
>>> +
>>> + first_bit = section->offset_within_region / block_size;
>>> + first_bit = find_next_zero_bit(attr->bitmap, attr->bitmap_size,
>>> + first_bit);
>>> +
>>> + while (first_bit < attr->bitmap_size) {
>>> + MemoryRegionSection tmp = *section;
>>> +
>>> + offset = first_bit * block_size;
>>> + last_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
>>> + first_bit + 1) - 1;
>>> + size = (last_bit - first_bit + 1) * block_size;
>>> +
>>> + if (!memory_region_section_intersect_range(&tmp, offset,
>>> size)) {
>>> + break;
>>> + }
>>> +
>>> + ret = cb(&tmp, arg);
>>> + if (ret) {
>>> + error_report("%s: Failed to notify RAM discard listener:
>>> %s",
>>> + __func__, strerror(-ret));
>>> + break;
>>> + }
>>> +
>>> + first_bit = find_next_zero_bit(attr->bitmap,
>>> + attr->bitmap_size,
>>> + last_bit + 2);
>>> + }
>>> +
>>> + return ret;
>>> +}
>>> +
>>> +static uint64_t
>>> +ram_block_attributes_rdm_get_min_granularity(const RamDiscardManager
>>> *rdm,
>>> + const MemoryRegion *mr)
>>> +{
>>> + const RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>>> +
>>> + g_assert(mr == attr->ram_block->mr);
>>> + return ram_block_attributes_get_block_size(attr);
>>> +}
>>> +
>>> +static void
>>> +ram_block_attributes_rdm_register_listener(RamDiscardManager *rdm,
>>> + RamDiscardListener *rdl,
>>> + MemoryRegionSection
>>> *section)
>>> +{
>>> + RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>>> + int ret;
>>> +
>>> + g_assert(section->mr == attr->ram_block->mr);
>>> + rdl->section = memory_region_section_new_copy(section);
>>> +
>>> + QLIST_INSERT_HEAD(&attr->rdl_list, rdl, next);
>>> +
>>> + ret = ram_block_attributes_for_each_populated_section(attr,
>>> section, rdl,
>>> +
>>> ram_block_attributes_notify_populate_cb);
>>> + if (ret) {
>>> + error_report("%s: Failed to register RAM discard listener: %s",
>>> + __func__, strerror(-ret));
>>> + exit(1);
>>> + }
>>> +}
>>> +
>>> +static void
>>> +ram_block_attributes_rdm_unregister_listener(RamDiscardManager *rdm,
>>> + RamDiscardListener *rdl)
>>> +{
>>> + RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>>> + int ret;
>>> +
>>> + g_assert(rdl->section);
>>> + g_assert(rdl->section->mr == attr->ram_block->mr);
>>> +
>>> + if (rdl->double_discard_supported) {
>>> + rdl->notify_discard(rdl, rdl->section);
>>> + } else {
>>> + ret = ram_block_attributes_for_each_populated_section(attr,
>>> + rdl->section, rdl,
>>> ram_block_attributes_notify_discard_cb);
>>> + if (ret) {
>>> + error_report("%s: Failed to unregister RAM discard
>>> listener: %s",
>>> + __func__, strerror(-ret));
>>> + exit(1);
>>> + }
>>> + }
>>> +
>>> + memory_region_section_free_copy(rdl->section);
>>> + rdl->section = NULL;
>>> + QLIST_REMOVE(rdl, next);
>>> +}
>>> +
>>> +typedef struct RamBlockAttributesReplayData {
>>> + ReplayRamDiscardState fn;
>>> + void *opaque;
>>> +} RamBlockAttributesReplayData;
>>> +
>>> +static int ram_block_attributes_rdm_replay_cb(MemoryRegionSection
>>> *section,
>>> + void *arg)
>>> +{
>>> + RamBlockAttributesReplayData *data = arg;
>>> +
>>> + return data->fn(section, data->opaque);
>>> +}
>>> +
>>> +static int
>>> +ram_block_attributes_rdm_replay_populated(const RamDiscardManager *rdm,
>>> + MemoryRegionSection *section,
>>> + ReplayRamDiscardState
>>> replay_fn,
>>> + void *opaque)
>>> +{
>>> + RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>>> + RamBlockAttributesReplayData data = { .fn = replay_fn, .opaque =
>>> opaque };
>>> +
>>> + g_assert(section->mr == attr->ram_block->mr);
>>> + return ram_block_attributes_for_each_populated_section(attr,
>>> section, &data,
>>> +
>>> ram_block_attributes_rdm_replay_cb);
>>> +}
>>> +
>>> +static int
>>> +ram_block_attributes_rdm_replay_discarded(const RamDiscardManager *rdm,
>>> + MemoryRegionSection *section,
>>> + ReplayRamDiscardState
>>> replay_fn,
>>> + void *opaque)
>>> +{
>>> + RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm);
>>> + RamBlockAttributesReplayData data = { .fn = replay_fn, .opaque =
>>> opaque };
>>> +
>>> + g_assert(section->mr == attr->ram_block->mr);
>>> + return ram_block_attributes_for_each_discarded_section(attr,
>>> section, &data,
>>> +
>>> ram_block_attributes_rdm_replay_cb);
>>> +}
>>> +
>>> +static bool
>>> +ram_block_attributes_is_valid_range(RamBlockAttributes *attr,
>>> uint64_t offset,
>>> + uint64_t size)
>>> +{
>>> + MemoryRegion *mr = attr->ram_block->mr;
>>> +
>>> + g_assert(mr);
>>> +
>>> + uint64_t region_size = memory_region_size(mr);
>>> + const size_t block_size =
>>> ram_block_attributes_get_block_size(attr);
>>> +
>>> + if (!QEMU_IS_ALIGNED(offset, block_size) ||
>>> + !QEMU_IS_ALIGNED(size, block_size)) {
>>> + return false;
>>> + }
>>> + if (offset + size <= offset) {
>>> + return false;
>>> + }
>>> + if (offset + size > region_size) {
>>> + return false;
>>> + }
>>> + return true;
>>> +}
>>> +
>>> +static void ram_block_attributes_notify_discard(RamBlockAttributes
>>> *attr,
>>> + uint64_t offset,
>>> + uint64_t size)
>>> +{
>>> + RamDiscardListener *rdl;
>>> +
>>> + QLIST_FOREACH(rdl, &attr->rdl_list, next) {
>>> + MemoryRegionSection tmp = *rdl->section;
>>> +
>>> + if (!memory_region_section_intersect_range(&tmp, offset,
>>> size)) {
>>> + continue;
>>> + }
>>> + rdl->notify_discard(rdl, &tmp);
>>> + }
>>> +}
>>> +
>>> +static int
>>> +ram_block_attributes_notify_populate(RamBlockAttributes *attr,
>>> + uint64_t offset, uint64_t size)
>>> +{
>>> + RamDiscardListener *rdl;
>>> + int ret = 0;
>>> +
>>> + QLIST_FOREACH(rdl, &attr->rdl_list, next) {
>>> + MemoryRegionSection tmp = *rdl->section;
>>> +
>>> + if (!memory_region_section_intersect_range(&tmp, offset,
>>> size)) {
>>> + continue;
>>> + }
>>> + ret = rdl->notify_populate(rdl, &tmp);
>>> + if (ret) {
>>> + break;
>>> + }
>>> + }
>>> +
>>> + return ret;
>>> +}
>>> +
>>> +static bool
>>> ram_block_attributes_is_range_populated(RamBlockAttributes *attr,
>>> + uint64_t offset,
>>> + uint64_t size)
>>> +{
>>> + const size_t block_size =
>>> ram_block_attributes_get_block_size(attr);
>>> + const unsigned long first_bit = offset / block_size;
>>> + const unsigned long last_bit = first_bit + (size / block_size) - 1;
>>> + unsigned long found_bit;
>>> +
>>> + found_bit = find_next_zero_bit(attr->bitmap, last_bit + 1,
>>> + first_bit);
>>> + return found_bit > last_bit;
>>> +}
>>> +
>>> +static bool
>>> +ram_block_attributes_is_range_discarded(RamBlockAttributes *attr,
>>> + uint64_t offset, uint64_t size)
>>> +{
>>> + const size_t block_size =
>>> ram_block_attributes_get_block_size(attr);
>>> + const unsigned long first_bit = offset / block_size;
>>> + const unsigned long last_bit = first_bit + (size / block_size) - 1;
>>> + unsigned long found_bit;
>>> +
>>> + found_bit = find_next_bit(attr->bitmap, last_bit + 1, first_bit);
>>> + return found_bit > last_bit;
>>> +}
>>> +
>>> +int ram_block_attributes_state_change(RamBlockAttributes *attr,
>>> + uint64_t offset, uint64_t size,
>>> + bool to_discard)
>>> +{
>>> + const size_t block_size =
>>> ram_block_attributes_get_block_size(attr);
>>> + const unsigned long first_bit = offset / block_size;
>>> + const unsigned long nbits = size / block_size;
>>> + bool is_range_discarded, is_range_populated;
>>
>> Can be reduced to "discarded" and "populated".
>>
>>> + const uint64_t end = offset + size;
>>> + unsigned long bit;
>>> + uint64_t cur;
>>> + int ret = 0;
>>> +
>>> + if (!ram_block_attributes_is_valid_range(attr, offset, size)) {
>>> + error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
>>> + __func__, offset, size);
>>> + return -EINVAL;
>>> + }
>>> +
>>> + is_range_discarded =
>>> ram_block_attributes_is_range_discarded(attr, offset,
>>> + size);
>>
>> See - needlessly long line.
>>
>>> + is_range_populated =
>>> ram_block_attributes_is_range_populated(attr, offset,
>>> + size);
>>
>> If ram_block_attributes_is_range_populated() returned
>> (found_bit*block_size), you could tell from a single call if it is
>> populated (found_bit == size) or discarded (found_bit == 0), otherwise
>> it is a mix (and dump just this number in the tracepoint below).
>>
>> And then ditch ram_block_attributes_is_range_discarded() which is
>> practically cut-n-paste. And then open code
>> ram_block_attributes_is_range_populated().
>
> oops, cannot just drop find_next_bit(), my bad, need both
> find_next_bit() and find_next_zero_bit(). My point still stands though -
Yes. a single call is insufficient as found_bit == 0 can also be the
mixed case.
> if this is coded right here without helpers - it will look simpler. Thanks,
Do you mean unwrap the helper functions? Since there are no other
callers, you are right, it can be simpler.