On 25.02.25 03:00, Chenyi Qiang wrote:
On 2/21/2025 6:04 PM, Chenyi Qiang wrote:
On 2/21/2025 4:09 PM, David Hildenbrand wrote:
On 21.02.25 03:25, Chenyi Qiang wrote:
On 2/21/2025 3:39 AM, David Hildenbrand wrote:
On 20.02.25 17:13, Jean-Philippe Brucker wrote:
For Arm CCA we'd like the guest_memfd discard notifier to call the
IOMMU
notifiers and create e.g. VFIO mappings. The default VFIO discard
notifier isn't sufficient for CCA because the DMA addresses need a
translation (even without vIOMMU).
At the moment:
* guest_memfd_state_change() calls the populate() notifier
* the populate notifier() calls IOMMU notifiers
* the IOMMU notifier handler calls memory_get_xlat_addr() to get a VA
* it calls ram_discard_manager_is_populated() which fails.
guest_memfd_state_change() only changes the section's state after
calling the populate() notifier. We can't easily invert the order of
operation because it uses the old state bitmap to know which pages need
the populate() notifier.
I assume we talk about this code: [1]
[1] https://lkml.kernel.org/r/20250217081833.21568-1-
chenyi.qi...@intel.com
+static int memory_attribute_state_change(MemoryAttributeManager *mgr,
uint64_t offset,
+ uint64_t size, bool
shared_to_private)
+{
+ int block_size = memory_attribute_manager_get_block_size(mgr);
+ int ret = 0;
+
+ if (!memory_attribute_is_valid_range(mgr, offset, size)) {
+ error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
+ __func__, offset, size);
+ return -1;
+ }
+
+ if ((shared_to_private && memory_attribute_is_range_discarded(mgr,
offset, size)) ||
+ (!shared_to_private && memory_attribute_is_range_populated(mgr,
offset, size))) {
+ return 0;
+ }
+
+ if (shared_to_private) {
+ memory_attribute_notify_discard(mgr, offset, size);
+ } else {
+ ret = memory_attribute_notify_populate(mgr, offset, size);
+ }
+
+ if (!ret) {
+ unsigned long first_bit = offset / block_size;
+ unsigned long nbits = size / block_size;
+
+ g_assert((first_bit + nbits) <= mgr->bitmap_size);
+
+ if (shared_to_private) {
+ bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
+ } else {
+ bitmap_set(mgr->shared_bitmap, first_bit, nbits);
+ }
+
+ return 0;
+ }
+
+ return ret;
+}
Then, in memory_attribute_notify_populate(), we walk the bitmap again.
Why?
We just checked that it's all in the expected state, no?
virtio-mem doesn't handle it that way, so I'm curious why we would have
to do it here?
I was concerned about the case where the guest issues a request that
only partial of the range is in the desired state.
I think the main problem is the policy for the guest conversion request.
My current handling is:
1. When a conversion request is made for a range already in the desired
state, the helper simply returns success.
Yes.
2. For requests involving a range partially in the desired state, only
the necessary segments are converted, ensuring the entire range
complies with the request efficiently.
Ah, now I get:
+ if ((shared_to_private && memory_attribute_is_range_discarded(mgr,
offset, size)) ||
+ (!shared_to_private && memory_attribute_is_range_populated(mgr,
offset, size))) {
+ return 0;
+ }
+
We're not failing if it might already partially be in the other state.
3. In scenarios where a conversion request is declined by other systems,
such as a failure from VFIO during notify_populate(), the helper will
roll back the request, maintaining consistency.
And the policy of virtio-mem is to refuse the state change if not all
blocks are in the opposite state.
Yes.
Actually, this part is still a uncertain to me.
IIUC, the problem does not exist if we only convert a single page at a
time.
Is there a known use case where such partial conversions could happen?
I don't see such case yet. Actually, I'm trying to follow the behavior
of KVM_SET_MEMORY_ATTRIBUTES ioctl during page conversion. In KVM, it
doesn't reject the request if the whole range isn't in the opposite
state. It just uses xa_store() to update it. Also, I don't see the spec
says how to handle such case. To be robust, I just allow this special case.
BTW, per the status/bitmap track, the virtio-mem also changes the bitmap
after the plug/unplug notifier. This is the same, correct?
Right. But because we reject these partial requests, we don't have to
traverse the bitmap and could just adjust the bitmap operations.
Yes, If we treat it as a guest error/bug, we can adjust it.
Hi David, do you think which option is better? If prefer to reject the
partial requests, I'll change it in my next version.
Hi,
still scratching my head. Having to work around it as in this patch here is
suboptimal.
Could we simplify the whole thing while still allowing for (unexpected) partial
conversions?
Essentially: If states are mixed, fallback to a "1 block at a time" handling.
The only problem is: what to do if we fail halfway through? Well, we can only
have
such partial completions for "populate", not for discard.
Option a) Just leave it as "partially completed populate" and return the error.
The
bitmap and the notifiers are consistent.
Option b) Just discard everything: someone tried to convert something "partial
shared" to "shared". So maybe, if anything goes wrong, we can just have "all
private".
The question is also, what the expectation from the caller is: can the caller
even make progress on failure or do we have to retry until it works?
Both options would be compatible with "the caller must retry either way until it
works".
a) is probably easiest. Something like the following on top of your code:
From 40c001a57c3c1350f3a44288ccb013d903d300cf Mon Sep 17 00:00:00 2001
From: David Hildenbrand <da...@redhat.com>
Date: Tue, 25 Feb 2025 09:55:38 +0100
Subject: [PATCH] tmp
Signed-off-by: David Hildenbrand <da...@redhat.com>
---
system/memory-attribute-manager.c | 66 +++++++++++++++++--------------
1 file changed, 36 insertions(+), 30 deletions(-)
diff --git a/system/memory-attribute-manager.c
b/system/memory-attribute-manager.c
index 17c70cf677..31960e4a54 100644
--- a/system/memory-attribute-manager.c
+++ b/system/memory-attribute-manager.c
@@ -274,9 +274,7 @@ static void
memory_attribute_notify_discard(MemoryAttributeManager *mgr,
if (!memory_region_section_intersect_range(&tmp, offset, size)) {
continue;
}
-
- memory_attribute_for_each_populated_section(mgr, &tmp, rdl,
-
memory_attribute_notify_discard_cb);
+ rdl->notify_discard(rdl, &tmp);
}
}
@@ -292,9 +290,7 @@ static int memory_attribute_notify_populate(MemoryAttributeManager *mgr,
if (!memory_region_section_intersect_range(&tmp, offset, size)) {
continue;
}
-
- ret = memory_attribute_for_each_discarded_section(mgr, &tmp, rdl,
-
memory_attribute_notify_populate_cb);
+ ret = rdl->notify_populate(rdl, &tmp);
if (ret) {
break;
}
@@ -311,9 +307,7 @@ static int
memory_attribute_notify_populate(MemoryAttributeManager *mgr,
if (!memory_region_section_intersect_range(&tmp, offset, size)) {
continue;
}
-
- memory_attribute_for_each_discarded_section(mgr, &tmp, rdl2,
-
memory_attribute_notify_discard_cb);
+ rdl2->notify_discard(rdl2, &tmp);
}
}
return ret;
@@ -348,7 +342,10 @@ static bool
memory_attribute_is_range_discarded(MemoryAttributeManager *mgr,
static int memory_attribute_state_change(MemoryAttributeManager *mgr, uint64_t
offset,
uint64_t size, bool shared_to_private)
{
- int block_size = memory_attribute_manager_get_block_size(mgr);
+ const int block_size = memory_attribute_manager_get_block_size(mgr);
+ const unsigned long first_bit = offset / block_size;
+ const unsigned long nbits = size / block_size;
+ uint64_t cur_offset;
int ret = 0;
if (!memory_attribute_is_valid_range(mgr, offset, size)) {
@@ -357,32 +354,41 @@ static int
memory_attribute_state_change(MemoryAttributeManager *mgr, uint64_t o
return -1;
}
- if ((shared_to_private && memory_attribute_is_range_discarded(mgr, offset, size)) ||
- (!shared_to_private && memory_attribute_is_range_populated(mgr,
offset, size))) {
- return 0;
- }
-
if (shared_to_private) {
- memory_attribute_notify_discard(mgr, offset, size);
- } else {
- ret = memory_attribute_notify_populate(mgr, offset, size);
- }
-
- if (!ret) {
- unsigned long first_bit = offset / block_size;
- unsigned long nbits = size / block_size;
-
- g_assert((first_bit + nbits) <= mgr->bitmap_size);
-
- if (shared_to_private) {
+ if (memory_attribute_is_range_discarded(mgr, offset, size)) {
+ /* Already private. */
+ } else if (!memory_attribute_is_range_populated(mgr, offset, size)) {
+ /* Unexpected mixture: not completely shared. */
+ for (cur_offset = 0; cur_offset < offset; cur_offset +=
block_size) {
+ memory_attribute_state_change(mgr, cur_offset, block_size,
+ true);
+ }
+ } else {
+ /* Completely shared. */
bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
+ memory_attribute_notify_discard(mgr, offset, size);
+ }
+ } else {
+ if (memory_attribute_is_range_populated(mgr, offset, size)) {
+ /* Already shared. */
+ } else if (!memory_attribute_is_range_discarded(mgr, offset, size)) {
+ /* Unexpected mixture: not completely private. */
+ for (cur_offset = 0; cur_offset < offset; cur_offset +=
block_size) {
+ ret = memory_attribute_state_change(mgr, cur_offset, block_size,
+ false);
+ if (ret) {
+ break;
+ }
+ }
} else {
+ /* Completely private. */
bitmap_set(mgr->shared_bitmap, first_bit, nbits);
+ ret = memory_attribute_notify_populate(mgr, offset, size);
+ if (ret) {
+ bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
+ }
}
-
- return 0;
}
-
return ret;
}
--
2.48.1
--
Cheers,
David / dhildenb