memory: Allow creating IOMMU mappings from RAM discard populate notifiers

Chenyi Qiang Thu, 27 Feb 2025 21:40:35 -0800

On 2/27/2025 7:27 PM, David Hildenbrand wrote:
> On 27.02.25 04:26, Chenyi Qiang wrote:
>>
>>
>> On 2/26/2025 8:43 PM, Chenyi Qiang wrote:
>>>
>>>
>>> On 2/25/2025 5:41 PM, David Hildenbrand wrote:
>>>> On 25.02.25 03:00, Chenyi Qiang wrote:
>>>>>
>>>>>
>>>>> On 2/21/2025 6:04 PM, Chenyi Qiang wrote:
>>>>>>
>>>>>>
>>>>>> On 2/21/2025 4:09 PM, David Hildenbrand wrote:
>>>>>>> On 21.02.25 03:25, Chenyi Qiang wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/21/2025 3:39 AM, David Hildenbrand wrote:
>>>>>>>>> On 20.02.25 17:13, Jean-Philippe Brucker wrote:
>>>>>>>>>> For Arm CCA we'd like the guest_memfd discard notifier to call
>>>>>>>>>> the
>>>>>>>>>> IOMMU
>>>>>>>>>> notifiers and create e.g. VFIO mappings. The default VFIO discard
>>>>>>>>>> notifier isn't sufficient for CCA because the DMA addresses
>>>>>>>>>> need a
>>>>>>>>>> translation (even without vIOMMU).
>>>>>>>>>>
>>>>>>>>>> At the moment:
>>>>>>>>>> * guest_memfd_state_change() calls the populate() notifier
>>>>>>>>>> * the populate notifier() calls IOMMU notifiers
>>>>>>>>>> * the IOMMU notifier handler calls memory_get_xlat_addr() to get
>>>>>>>>>> a VA
>>>>>>>>>> * it calls ram_discard_manager_is_populated() which fails.
>>>>>>>>>>
>>>>>>>>>> guest_memfd_state_change() only changes the section's state after
>>>>>>>>>> calling the populate() notifier. We can't easily invert the
>>>>>>>>>> order of
>>>>>>>>>> operation because it uses the old state bitmap to know which
>>>>>>>>>> pages need
>>>>>>>>>> the populate() notifier.
>>>>>>>>>
>>>>>>>>> I assume we talk about this code: [1]
>>>>>>>>>
>>>>>>>>> [1] https://lkml.kernel.org/r/20250217081833.21568-1-
>>>>>>>>> chenyi.qi...@intel.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> +static int memory_attribute_state_change(MemoryAttributeManager
>>>>>>>>> *mgr,
>>>>>>>>> uint64_t offset,
>>>>>>>>> +                                         uint64_t size, bool
>>>>>>>>> shared_to_private)
>>>>>>>>> +{
>>>>>>>>> +    int block_size =
>>>>>>>>> memory_attribute_manager_get_block_size(mgr);
>>>>>>>>> +    int ret = 0;
>>>>>>>>> +
>>>>>>>>> +    if (!memory_attribute_is_valid_range(mgr, offset, size)) {
>>>>>>>>> +        error_report("%s, invalid range: offset 0x%lx, size
>>>>>>>>> 0x%lx",
>>>>>>>>> +                     __func__, offset, size);
>>>>>>>>> +        return -1;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    if ((shared_to_private &&
>>>>>>>>> memory_attribute_is_range_discarded(mgr,
>>>>>>>>> offset, size)) ||
>>>>>>>>> +        (!shared_to_private &&
>>>>>>>>> memory_attribute_is_range_populated(mgr,
>>>>>>>>> offset, size))) {
>>>>>>>>> +        return 0;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    if (shared_to_private) {
>>>>>>>>> +        memory_attribute_notify_discard(mgr, offset, size);
>>>>>>>>> +    } else {
>>>>>>>>> +        ret = memory_attribute_notify_populate(mgr, offset,
>>>>>>>>> size);
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    if (!ret) {
>>>>>>>>> +        unsigned long first_bit = offset / block_size;
>>>>>>>>> +        unsigned long nbits = size / block_size;
>>>>>>>>> +
>>>>>>>>> +        g_assert((first_bit + nbits) <= mgr->bitmap_size);
>>>>>>>>> +
>>>>>>>>> +        if (shared_to_private) {
>>>>>>>>> +            bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
>>>>>>>>> +        } else {
>>>>>>>>> +            bitmap_set(mgr->shared_bitmap, first_bit, nbits);
>>>>>>>>> +        }
>>>>>>>>> +
>>>>>>>>> +        return 0;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    return ret;
>>>>>>>>> +}
>>>>>>>>>
>>>>>>>>> Then, in memory_attribute_notify_populate(), we walk the bitmap
>>>>>>>>> again.
>>>>>>>>>
>>>>>>>>> Why?
>>>>>>>>>
>>>>>>>>> We just checked that it's all in the expected state, no?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> virtio-mem doesn't handle it that way, so I'm curious why we would
>>>>>>>>> have
>>>>>>>>> to do it here?
>>>>>>>>
>>>>>>>> I was concerned about the case where the guest issues a request
>>>>>>>> that
>>>>>>>> only partial of the range is in the desired state.
>>>>>>>> I think the main problem is the policy for the guest conversion
>>>>>>>> request.
>>>>>>>> My current handling is:
>>>>>>>>
>>>>>>>> 1. When a conversion request is made for a range already in the
>>>>>>>> desired
>>>>>>>>      state, the helper simply returns success.
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>>> 2. For requests involving a range partially in the desired
>>>>>>>> state, only
>>>>>>>>      the necessary segments are converted, ensuring the entire
>>>>>>>> range
>>>>>>>>      complies with the request efficiently.
>>>>>>>
>>>>>>>
>>>>>>> Ah, now I get:
>>>>>>>
>>>>>>> +    if ((shared_to_private &&
>>>>>>> memory_attribute_is_range_discarded(mgr,
>>>>>>> offset, size)) ||
>>>>>>> +        (!shared_to_private &&
>>>>>>> memory_attribute_is_range_populated(mgr,
>>>>>>> offset, size))) {
>>>>>>> +        return 0;
>>>>>>> +    }
>>>>>>> +
>>>>>>>
>>>>>>> We're not failing if it might already partially be in the other
>>>>>>> state.
>>>>>>>
>>>>>>>> 3. In scenarios where a conversion request is declined by other
>>>>>>>> systems,
>>>>>>>>      such as a failure from VFIO during notify_populate(), the
>>>>>>>> helper will
>>>>>>>>      roll back the request, maintaining consistency.
>>>>>>>>
>>>>>>>> And the policy of virtio-mem is to refuse the state change if
>>>>>>>> not all
>>>>>>>> blocks are in the opposite state.
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>>>
>>>>>>>> Actually, this part is still a uncertain to me.
>>>>>>>>
>>>>>>>
>>>>>>> IIUC, the problem does not exist if we only convert a single page
>>>>>>> at a
>>>>>>> time.
>>>>>>>
>>>>>>> Is there a known use case where such partial conversions could
>>>>>>> happen?
>>>>>>
>>>>>> I don't see such case yet. Actually, I'm trying to follow the
>>>>>> behavior
>>>>>> of KVM_SET_MEMORY_ATTRIBUTES ioctl during page conversion. In KVM, it
>>>>>> doesn't reject the request if the whole range isn't in the opposite
>>>>>> state. It just uses xa_store() to update it. Also, I don't see the
>>>>>> spec
>>>>>> says how to handle such case. To be robust, I just allow this special
>>>>>> case.
>>>>>>
>>>>>>>
>>>>>>>> BTW, per the status/bitmap track, the virtio-mem also changes the
>>>>>>>> bitmap
>>>>>>>> after the plug/unplug notifier. This is the same, correct?
>>>>>>> Right. But because we reject these partial requests, we don't
>>>>>>> have to
>>>>>>> traverse the bitmap and could just adjust the bitmap operations.
>>>>>>
>>>>>> Yes, If we treat it as a guest error/bug, we can adjust it.
>>>>>
>>>>> Hi David, do you think which option is better? If prefer to reject the
>>>>> partial requests, I'll change it in my next version.
>>>>
>>>> Hi,
>>>>
>>>> still scratching my head. Having to work around it as in this patch
>>>> here is
>>>> suboptimal.
>>>>
>>>> Could we simplify the whole thing while still allowing for (unexpected)
>>>> partial
>>>> conversions?
>>>>
>>>> Essentially: If states are mixed, fallback to a "1 block at a time"
>>>> handling.
>>>>
>>>> The only problem is: what to do if we fail halfway through? Well, we
>>>> can
>>>> only have
>>>> such partial completions for "populate", not for discard.
>>>>
>>>> Option a) Just leave it as "partially completed populate" and return
>>>> the
>>>> error. The
>>>> bitmap and the notifiers are consistent.
>>>>
>>>> Option b) Just discard everything: someone tried to convert something
>>>> "partial
>>>> shared" to "shared". So maybe, if anything goes wrong, we can just have
>>>> "all private".
>>>>
>>>> The question is also, what the expectation from the caller is: can the
>>>> caller
>>>> even make progress on failure or do we have to retry until it works?
>>>
>>> Yes, That's the key problem.
>>>
>>> For core mm side conversion, The caller (guest) handles three case:
>>> success, failure and retry. guest can continue on failure but will keep
>>> the memory in its original attribute and trigger some calltrace. While
>>> in QEMU side, it would cause VM stop if kvm_set_memory_attributes()
>>> failed.
>>>
>>> As for the VFIO conversion, at present, we allow it to fail and don't
>>> return error code to guest as long as we undo the conversion. It only
>>> causes the device not work in guest.
>>>
>>> I think if we view the attribute mismatch between core mm and IOMMU as a
>>> fatal error, we can call VM stop or let guest retry until it converts
>>> successfully.
>>>
>>
>> Just think more about the options for the failure case handling
>> theoretically as we haven't hit such state_change() failure:
>>
>> 1. Undo + return invalid error
>> Pros: The guest can make progress
>> Cons: Complicated undo operations: Option a) is not appliable, because
>> it leaves it as partial completed populate, but the guest thinks the
>> operation has failed.
>> Also need to add the undo for set_memory_attribute() after
>> state_change() failed. Maybe also apply the attribute bitmap to
>> set_memory_attribute() operation to handle the mixed request case
>>
>> 2. Undo in VFIO and no undo for set_memory_attribute() + return success
>> (Current approach in my series)
>> Pros: The guest can make progress although device doesn't work.
>> Cons: the attribute bitmap only tracks the status in iommu.
> 
> Right, we should avoid that. Bitmap + notifiers should stay in sync.
> 
>>
>> 3. No undo + return retry
>> Pros: It keeps the attribute bitmap aligned in core mm and iommu.
>> Cons: The guest doesn't know how to handle the retry. It would cause
>> infinite loop.
>>
>> 4. No undo + no return. Just VM stop.
>> Pros: simple
>> Cons: maybe overkill.
>>
>> Maybe option 1 or 4 is better?
> 
> Well, we can proper undo working using a temporary bitmap when
> converting to shared and we run into this "mixed" scenario.
> 
> Something like this on top of your work:

LGTM. I'll choose option 1 and add the change in my next version. Thanks
a lot!
Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers

Reply via email to