memory: Allow creating IOMMU mappings from RAM discard populate notifiers

Chenyi Qiang Wed, 26 Feb 2025 19:28:58 -0800

On 2/26/2025 8:43 PM, Chenyi Qiang wrote:
> 
> 
> On 2/25/2025 5:41 PM, David Hildenbrand wrote:
>> On 25.02.25 03:00, Chenyi Qiang wrote:
>>>
>>>
>>> On 2/21/2025 6:04 PM, Chenyi Qiang wrote:
>>>>
>>>>
>>>> On 2/21/2025 4:09 PM, David Hildenbrand wrote:
>>>>> On 21.02.25 03:25, Chenyi Qiang wrote:
>>>>>>
>>>>>>
>>>>>> On 2/21/2025 3:39 AM, David Hildenbrand wrote:
>>>>>>> On 20.02.25 17:13, Jean-Philippe Brucker wrote:
>>>>>>>> For Arm CCA we'd like the guest_memfd discard notifier to call the
>>>>>>>> IOMMU
>>>>>>>> notifiers and create e.g. VFIO mappings. The default VFIO discard
>>>>>>>> notifier isn't sufficient for CCA because the DMA addresses need a
>>>>>>>> translation (even without vIOMMU).
>>>>>>>>
>>>>>>>> At the moment:
>>>>>>>> * guest_memfd_state_change() calls the populate() notifier
>>>>>>>> * the populate notifier() calls IOMMU notifiers
>>>>>>>> * the IOMMU notifier handler calls memory_get_xlat_addr() to get
>>>>>>>> a VA
>>>>>>>> * it calls ram_discard_manager_is_populated() which fails.
>>>>>>>>
>>>>>>>> guest_memfd_state_change() only changes the section's state after
>>>>>>>> calling the populate() notifier. We can't easily invert the order of
>>>>>>>> operation because it uses the old state bitmap to know which
>>>>>>>> pages need
>>>>>>>> the populate() notifier.
>>>>>>>
>>>>>>> I assume we talk about this code: [1]
>>>>>>>
>>>>>>> [1] https://lkml.kernel.org/r/20250217081833.21568-1-
>>>>>>> chenyi.qi...@intel.com
>>>>>>>
>>>>>>>
>>>>>>> +static int memory_attribute_state_change(MemoryAttributeManager
>>>>>>> *mgr,
>>>>>>> uint64_t offset,
>>>>>>> +                                         uint64_t size, bool
>>>>>>> shared_to_private)
>>>>>>> +{
>>>>>>> +    int block_size = memory_attribute_manager_get_block_size(mgr);
>>>>>>> +    int ret = 0;
>>>>>>> +
>>>>>>> +    if (!memory_attribute_is_valid_range(mgr, offset, size)) {
>>>>>>> +        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
>>>>>>> +                     __func__, offset, size);
>>>>>>> +        return -1;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    if ((shared_to_private &&
>>>>>>> memory_attribute_is_range_discarded(mgr,
>>>>>>> offset, size)) ||
>>>>>>> +        (!shared_to_private &&
>>>>>>> memory_attribute_is_range_populated(mgr,
>>>>>>> offset, size))) {
>>>>>>> +        return 0;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    if (shared_to_private) {
>>>>>>> +        memory_attribute_notify_discard(mgr, offset, size);
>>>>>>> +    } else {
>>>>>>> +        ret = memory_attribute_notify_populate(mgr, offset, size);
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    if (!ret) {
>>>>>>> +        unsigned long first_bit = offset / block_size;
>>>>>>> +        unsigned long nbits = size / block_size;
>>>>>>> +
>>>>>>> +        g_assert((first_bit + nbits) <= mgr->bitmap_size);
>>>>>>> +
>>>>>>> +        if (shared_to_private) {
>>>>>>> +            bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
>>>>>>> +        } else {
>>>>>>> +            bitmap_set(mgr->shared_bitmap, first_bit, nbits);
>>>>>>> +        }
>>>>>>> +
>>>>>>> +        return 0;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    return ret;
>>>>>>> +}
>>>>>>>
>>>>>>> Then, in memory_attribute_notify_populate(), we walk the bitmap
>>>>>>> again.
>>>>>>>
>>>>>>> Why?
>>>>>>>
>>>>>>> We just checked that it's all in the expected state, no?
>>>>>>>
>>>>>>>
>>>>>>> virtio-mem doesn't handle it that way, so I'm curious why we would
>>>>>>> have
>>>>>>> to do it here?
>>>>>>
>>>>>> I was concerned about the case where the guest issues a request that
>>>>>> only partial of the range is in the desired state.
>>>>>> I think the main problem is the policy for the guest conversion
>>>>>> request.
>>>>>> My current handling is:
>>>>>>
>>>>>> 1. When a conversion request is made for a range already in the
>>>>>> desired
>>>>>>     state, the helper simply returns success.
>>>>>
>>>>> Yes.
>>>>>
>>>>>> 2. For requests involving a range partially in the desired state, only
>>>>>>     the necessary segments are converted, ensuring the entire range
>>>>>>     complies with the request efficiently.
>>>>>
>>>>>
>>>>> Ah, now I get:
>>>>>
>>>>> +    if ((shared_to_private && memory_attribute_is_range_discarded(mgr,
>>>>> offset, size)) ||
>>>>> +        (!shared_to_private &&
>>>>> memory_attribute_is_range_populated(mgr,
>>>>> offset, size))) {
>>>>> +        return 0;
>>>>> +    }
>>>>> +
>>>>>
>>>>> We're not failing if it might already partially be in the other state.
>>>>>
>>>>>> 3. In scenarios where a conversion request is declined by other
>>>>>> systems,
>>>>>>     such as a failure from VFIO during notify_populate(), the
>>>>>> helper will
>>>>>>     roll back the request, maintaining consistency.
>>>>>>
>>>>>> And the policy of virtio-mem is to refuse the state change if not all
>>>>>> blocks are in the opposite state.
>>>>>
>>>>> Yes.
>>>>>
>>>>>>
>>>>>> Actually, this part is still a uncertain to me.
>>>>>>
>>>>>
>>>>> IIUC, the problem does not exist if we only convert a single page at a
>>>>> time.
>>>>>
>>>>> Is there a known use case where such partial conversions could happen?
>>>>
>>>> I don't see such case yet. Actually, I'm trying to follow the behavior
>>>> of KVM_SET_MEMORY_ATTRIBUTES ioctl during page conversion. In KVM, it
>>>> doesn't reject the request if the whole range isn't in the opposite
>>>> state. It just uses xa_store() to update it. Also, I don't see the spec
>>>> says how to handle such case. To be robust, I just allow this special
>>>> case.
>>>>
>>>>>
>>>>>> BTW, per the status/bitmap track, the virtio-mem also changes the
>>>>>> bitmap
>>>>>> after the plug/unplug notifier. This is the same, correct?
>>>>> Right. But because we reject these partial requests, we don't have to
>>>>> traverse the bitmap and could just adjust the bitmap operations.
>>>>
>>>> Yes, If we treat it as a guest error/bug, we can adjust it.
>>>
>>> Hi David, do you think which option is better? If prefer to reject the
>>> partial requests, I'll change it in my next version.
>>
>> Hi,
>>
>> still scratching my head. Having to work around it as in this patch here is
>> suboptimal.
>>
>> Could we simplify the whole thing while still allowing for (unexpected)
>> partial
>> conversions?
>>
>> Essentially: If states are mixed, fallback to a "1 block at a time"
>> handling.
>>
>> The only problem is: what to do if we fail halfway through? Well, we can
>> only have
>> such partial completions for "populate", not for discard.
>>
>> Option a) Just leave it as "partially completed populate" and return the
>> error. The
>> bitmap and the notifiers are consistent.
>>
>> Option b) Just discard everything: someone tried to convert something
>> "partial
>> shared" to "shared". So maybe, if anything goes wrong, we can just have
>> "all private".
>>
>> The question is also, what the expectation from the caller is: can the
>> caller
>> even make progress on failure or do we have to retry until it works?
> 
> Yes, That's the key problem.
> 
> For core mm side conversion, The caller (guest) handles three case:
> success, failure and retry. guest can continue on failure but will keep
> the memory in its original attribute and trigger some calltrace. While
> in QEMU side, it would cause VM stop if kvm_set_memory_attributes() failed.
> 
> As for the VFIO conversion, at present, we allow it to fail and don't
> return error code to guest as long as we undo the conversion. It only
> causes the device not work in guest.
> 
> I think if we view the attribute mismatch between core mm and IOMMU as a
> fatal error, we can call VM stop or let guest retry until it converts
> successfully.
> 

Just think more about the options for the failure case handling
theoretically as we haven't hit such state_change() failure:

1. Undo + return invalid error
Pros: The guest can make progress
Cons: Complicated undo operations: Option a) is not appliable, because
it leaves it as partial completed populate, but the guest thinks the
operation has failed.
Also need to add the undo for set_memory_attribute() after
state_change() failed. Maybe also apply the attribute bitmap to
set_memory_attribute() operation to handle the mixed request case

2. Undo in VFIO and no undo for set_memory_attribute() + return success
(Current approach in my series)
Pros: The guest can make progress although device doesn't work.
Cons: the attribute bitmap only tracks the status in iommu.

3. No undo + return retry
Pros: It keeps the attribute bitmap aligned in core mm and iommu.
Cons: The guest doesn't know how to handle the retry. It would cause
infinite loop.

4. No undo + no return. Just VM stop.
Pros: simple
Cons: maybe overkill.

Maybe option 1 or 4 is better?

>
Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers

Reply via email to