On 27.02.25 04:26, Chenyi Qiang wrote:


On 2/26/2025 8:43 PM, Chenyi Qiang wrote:


On 2/25/2025 5:41 PM, David Hildenbrand wrote:
On 25.02.25 03:00, Chenyi Qiang wrote:


On 2/21/2025 6:04 PM, Chenyi Qiang wrote:


On 2/21/2025 4:09 PM, David Hildenbrand wrote:
On 21.02.25 03:25, Chenyi Qiang wrote:


On 2/21/2025 3:39 AM, David Hildenbrand wrote:
On 20.02.25 17:13, Jean-Philippe Brucker wrote:
For Arm CCA we'd like the guest_memfd discard notifier to call the
IOMMU
notifiers and create e.g. VFIO mappings. The default VFIO discard
notifier isn't sufficient for CCA because the DMA addresses need a
translation (even without vIOMMU).

At the moment:
* guest_memfd_state_change() calls the populate() notifier
* the populate notifier() calls IOMMU notifiers
* the IOMMU notifier handler calls memory_get_xlat_addr() to get
a VA
* it calls ram_discard_manager_is_populated() which fails.

guest_memfd_state_change() only changes the section's state after
calling the populate() notifier. We can't easily invert the order of
operation because it uses the old state bitmap to know which
pages need
the populate() notifier.

I assume we talk about this code: [1]

[1] https://lkml.kernel.org/r/20250217081833.21568-1-
chenyi.qi...@intel.com


+static int memory_attribute_state_change(MemoryAttributeManager
*mgr,
uint64_t offset,
+                                         uint64_t size, bool
shared_to_private)
+{
+    int block_size = memory_attribute_manager_get_block_size(mgr);
+    int ret = 0;
+
+    if (!memory_attribute_is_valid_range(mgr, offset, size)) {
+        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
+                     __func__, offset, size);
+        return -1;
+    }
+
+    if ((shared_to_private &&
memory_attribute_is_range_discarded(mgr,
offset, size)) ||
+        (!shared_to_private &&
memory_attribute_is_range_populated(mgr,
offset, size))) {
+        return 0;
+    }
+
+    if (shared_to_private) {
+        memory_attribute_notify_discard(mgr, offset, size);
+    } else {
+        ret = memory_attribute_notify_populate(mgr, offset, size);
+    }
+
+    if (!ret) {
+        unsigned long first_bit = offset / block_size;
+        unsigned long nbits = size / block_size;
+
+        g_assert((first_bit + nbits) <= mgr->bitmap_size);
+
+        if (shared_to_private) {
+            bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
+        } else {
+            bitmap_set(mgr->shared_bitmap, first_bit, nbits);
+        }
+
+        return 0;
+    }
+
+    return ret;
+}

Then, in memory_attribute_notify_populate(), we walk the bitmap
again.

Why?

We just checked that it's all in the expected state, no?


virtio-mem doesn't handle it that way, so I'm curious why we would
have
to do it here?

I was concerned about the case where the guest issues a request that
only partial of the range is in the desired state.
I think the main problem is the policy for the guest conversion
request.
My current handling is:

1. When a conversion request is made for a range already in the
desired
     state, the helper simply returns success.

Yes.

2. For requests involving a range partially in the desired state, only
     the necessary segments are converted, ensuring the entire range
     complies with the request efficiently.


Ah, now I get:

+    if ((shared_to_private && memory_attribute_is_range_discarded(mgr,
offset, size)) ||
+        (!shared_to_private &&
memory_attribute_is_range_populated(mgr,
offset, size))) {
+        return 0;
+    }
+

We're not failing if it might already partially be in the other state.

3. In scenarios where a conversion request is declined by other
systems,
     such as a failure from VFIO during notify_populate(), the
helper will
     roll back the request, maintaining consistency.

And the policy of virtio-mem is to refuse the state change if not all
blocks are in the opposite state.

Yes.


Actually, this part is still a uncertain to me.


IIUC, the problem does not exist if we only convert a single page at a
time.

Is there a known use case where such partial conversions could happen?

I don't see such case yet. Actually, I'm trying to follow the behavior
of KVM_SET_MEMORY_ATTRIBUTES ioctl during page conversion. In KVM, it
doesn't reject the request if the whole range isn't in the opposite
state. It just uses xa_store() to update it. Also, I don't see the spec
says how to handle such case. To be robust, I just allow this special
case.


BTW, per the status/bitmap track, the virtio-mem also changes the
bitmap
after the plug/unplug notifier. This is the same, correct?
Right. But because we reject these partial requests, we don't have to
traverse the bitmap and could just adjust the bitmap operations.

Yes, If we treat it as a guest error/bug, we can adjust it.

Hi David, do you think which option is better? If prefer to reject the
partial requests, I'll change it in my next version.

Hi,

still scratching my head. Having to work around it as in this patch here is
suboptimal.

Could we simplify the whole thing while still allowing for (unexpected)
partial
conversions?

Essentially: If states are mixed, fallback to a "1 block at a time"
handling.

The only problem is: what to do if we fail halfway through? Well, we can
only have
such partial completions for "populate", not for discard.

Option a) Just leave it as "partially completed populate" and return the
error. The
bitmap and the notifiers are consistent.

Option b) Just discard everything: someone tried to convert something
"partial
shared" to "shared". So maybe, if anything goes wrong, we can just have
"all private".

The question is also, what the expectation from the caller is: can the
caller
even make progress on failure or do we have to retry until it works?

Yes, That's the key problem.

For core mm side conversion, The caller (guest) handles three case:
success, failure and retry. guest can continue on failure but will keep
the memory in its original attribute and trigger some calltrace. While
in QEMU side, it would cause VM stop if kvm_set_memory_attributes() failed.

As for the VFIO conversion, at present, we allow it to fail and don't
return error code to guest as long as we undo the conversion. It only
causes the device not work in guest.

I think if we view the attribute mismatch between core mm and IOMMU as a
fatal error, we can call VM stop or let guest retry until it converts
successfully.


Just think more about the options for the failure case handling
theoretically as we haven't hit such state_change() failure:

1. Undo + return invalid error
Pros: The guest can make progress
Cons: Complicated undo operations: Option a) is not appliable, because
it leaves it as partial completed populate, but the guest thinks the
operation has failed.
Also need to add the undo for set_memory_attribute() after
state_change() failed. Maybe also apply the attribute bitmap to
set_memory_attribute() operation to handle the mixed request case

2. Undo in VFIO and no undo for set_memory_attribute() + return success
(Current approach in my series)
Pros: The guest can make progress although device doesn't work.
Cons: the attribute bitmap only tracks the status in iommu.

Right, we should avoid that. Bitmap + notifiers should stay in sync.


3. No undo + return retry
Pros: It keeps the attribute bitmap aligned in core mm and iommu.
Cons: The guest doesn't know how to handle the retry. It would cause
infinite loop.

4. No undo + no return. Just VM stop.
Pros: simple
Cons: maybe overkill.

Maybe option 1 or 4 is better?

Well, we can proper undo working using a temporary bitmap when
converting to shared and we run into this "mixed" scenario.

Something like this on top of your work:


From f36e3916596ed5952f15233eb7747c65a6424949 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <da...@redhat.com>
Date: Tue, 25 Feb 2025 09:55:38 +0100
Subject: [PATCH] tmp

Signed-off-by: David Hildenbrand <da...@redhat.com>
---
 system/memory-attribute-manager.c | 95 +++++++++++++++++++++----------
 1 file changed, 65 insertions(+), 30 deletions(-)

diff --git a/system/memory-attribute-manager.c 
b/system/memory-attribute-manager.c
index 17c70cf677..e98e7367c1 100644
--- a/system/memory-attribute-manager.c
+++ b/system/memory-attribute-manager.c
@@ -274,9 +274,7 @@ static void 
memory_attribute_notify_discard(MemoryAttributeManager *mgr,
         if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             continue;
         }
-
-        memory_attribute_for_each_populated_section(mgr, &tmp, rdl,
-                                                    
memory_attribute_notify_discard_cb);
+        rdl->notify_discard(rdl, &tmp);
     }
 }
@@ -292,9 +290,7 @@ static int memory_attribute_notify_populate(MemoryAttributeManager *mgr,
         if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             continue;
         }
-
-        ret = memory_attribute_for_each_discarded_section(mgr, &tmp, rdl,
-                                                          
memory_attribute_notify_populate_cb);
+        ret = rdl->notify_populate(rdl, &tmp);
         if (ret) {
             break;
         }
@@ -311,9 +307,7 @@ static int 
memory_attribute_notify_populate(MemoryAttributeManager *mgr,
             if (!memory_region_section_intersect_range(&tmp, offset, size)) {
                 continue;
             }
-
-            memory_attribute_for_each_discarded_section(mgr, &tmp, rdl2,
-                                                        
memory_attribute_notify_discard_cb);
+            rdl2->notify_discard(rdl2, &tmp);
         }
     }
     return ret;
@@ -348,7 +342,12 @@ static bool 
memory_attribute_is_range_discarded(MemoryAttributeManager *mgr,
 static int memory_attribute_state_change(MemoryAttributeManager *mgr, uint64_t 
offset,
                                          uint64_t size, bool shared_to_private)
 {
-    int block_size = memory_attribute_manager_get_block_size(mgr);
+    const int block_size = memory_attribute_manager_get_block_size(mgr);
+    const unsigned long first_bit = offset / block_size;
+    const unsigned long nbits = size / block_size;
+    const uint64_t end = offset + size;
+    unsigned long bit;
+    uint64_t cur;
     int ret = 0;
if (!memory_attribute_is_valid_range(mgr, offset, size)) {
@@ -357,32 +356,68 @@ static int 
memory_attribute_state_change(MemoryAttributeManager *mgr, uint64_t o
         return -1;
     }
- if ((shared_to_private && memory_attribute_is_range_discarded(mgr, offset, size)) ||
-        (!shared_to_private && memory_attribute_is_range_populated(mgr, 
offset, size))) {
-        return 0;
-    }
-
     if (shared_to_private) {
-        memory_attribute_notify_discard(mgr, offset, size);
-    } else {
-        ret = memory_attribute_notify_populate(mgr, offset, size);
-    }
-
-    if (!ret) {
-        unsigned long first_bit = offset / block_size;
-        unsigned long nbits = size / block_size;
-
-        g_assert((first_bit + nbits) <= mgr->bitmap_size);
-
-        if (shared_to_private) {
+        if (memory_attribute_is_range_discarded(mgr, offset, size)) {
+            /* Already private. */
+        } else if (!memory_attribute_is_range_populated(mgr, offset, size)) {
+            /* Unexpected mixture: process individual blocks. */
+            for (cur = offset; cur < end; cur += block_size) {
+                bit = cur / block_size;
+                if (!test_bit(bit, mgr->shared_bitmap))
+                    continue;
+                clear_bit(bit, mgr->shared_bitmap);
+                memory_attribute_notify_discard(mgr, cur, block_size);
+            }
+        } else {
+            /* Completely shared. */
             bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
+            memory_attribute_notify_discard(mgr, offset, size);
+        }
+    } else {
+        if (memory_attribute_is_range_populated(mgr, offset, size)) {
+            /* Already shared. */
+        } else if (!memory_attribute_is_range_discarded(mgr, offset, size)) {
+            /* Unexpected mixture: process individual blocks. */
+            unsigned long *modified_bitmap = bitmap_new(nbits);
+
+            for (cur = offset; cur < end; cur += block_size) {
+                bit = cur / block_size;
+                if (test_bit(bit, mgr->shared_bitmap))
+                    continue;
+                set_bit(bit, mgr->shared_bitmap);
+                ret = memory_attribute_notify_populate(mgr, cur, block_size);
+                if (!ret) {
+                    set_bit(bit - first_bit, modified_bitmap);
+                    continue;
+                }
+                clear_bit(bit, mgr->shared_bitmap);
+                break;
+            }
+            if (ret) {
+                /*
+                 * Very unexpected: something went wrong. Revert to the old
+                 * state, marking only the blocks as private that we converted
+                 * to shared.
+                 */
+                for (cur = offset; cur < end; cur += block_size) {
+                    bit = cur / block_size;
+                    if (!test_bit(bit - first_bit, modified_bitmap))
+                        continue;
+                    assert(test_bit(bit, mgr->shared_bitmap));
+                    clear_bit(bit, mgr->shared_bitmap);
+                    memory_attribute_notify_discard(mgr, offset, block_size);
+                }
+            }
+            g_free(modified_bitmap);
         } else {
+            /* Completely private. */
             bitmap_set(mgr->shared_bitmap, first_bit, nbits);
+            ret = memory_attribute_notify_populate(mgr, offset, size);
+            if (ret) {
+                bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
+            }
         }
-
-        return 0;
     }
-
     return ret;
 }
--
2.48.1


--
Cheers,

David / dhildenb


Reply via email to