On 2/21/2025 6:04 PM, Chenyi Qiang wrote:
>
>
> On 2/21/2025 4:09 PM, David Hildenbrand wrote:
>> On 21.02.25 03:25, Chenyi Qiang wrote:
>>>
>>>
>>> On 2/21/2025 3:39 AM, David Hildenbrand wrote:
>>>> On 20.02.25 17:13, Jean-Philippe Brucker wrote:
>>>>> For Arm CCA we'd like the guest_memfd discard notifier to call the
>>>>> IOMMU
>>>>> notifiers and create e.g. VFIO mappings. The default VFIO discard
>>>>> notifier isn't sufficient for CCA because the DMA addresses need a
>>>>> translation (even without vIOMMU).
>>>>>
>>>>> At the moment:
>>>>> * guest_memfd_state_change() calls the populate() notifier
>>>>> * the populate notifier() calls IOMMU notifiers
>>>>> * the IOMMU notifier handler calls memory_get_xlat_addr() to get a VA
>>>>> * it calls ram_discard_manager_is_populated() which fails.
>>>>>
>>>>> guest_memfd_state_change() only changes the section's state after
>>>>> calling the populate() notifier. We can't easily invert the order of
>>>>> operation because it uses the old state bitmap to know which pages need
>>>>> the populate() notifier.
>>>>
>>>> I assume we talk about this code: [1]
>>>>
>>>> [1] https://lkml.kernel.org/r/20250217081833.21568-1-
>>>> chenyi.qi...@intel.com
>>>>
>>>>
>>>> +static int memory_attribute_state_change(MemoryAttributeManager *mgr,
>>>> uint64_t offset,
>>>> + uint64_t size, bool
>>>> shared_to_private)
>>>> +{
>>>> + int block_size = memory_attribute_manager_get_block_size(mgr);
>>>> + int ret = 0;
>>>> +
>>>> + if (!memory_attribute_is_valid_range(mgr, offset, size)) {
>>>> + error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
>>>> + __func__, offset, size);
>>>> + return -1;
>>>> + }
>>>> +
>>>> + if ((shared_to_private && memory_attribute_is_range_discarded(mgr,
>>>> offset, size)) ||
>>>> + (!shared_to_private && memory_attribute_is_range_populated(mgr,
>>>> offset, size))) {
>>>> + return 0;
>>>> + }
>>>> +
>>>> + if (shared_to_private) {
>>>> + memory_attribute_notify_discard(mgr, offset, size);
>>>> + } else {
>>>> + ret = memory_attribute_notify_populate(mgr, offset, size);
>>>> + }
>>>> +
>>>> + if (!ret) {
>>>> + unsigned long first_bit = offset / block_size;
>>>> + unsigned long nbits = size / block_size;
>>>> +
>>>> + g_assert((first_bit + nbits) <= mgr->bitmap_size);
>>>> +
>>>> + if (shared_to_private) {
>>>> + bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
>>>> + } else {
>>>> + bitmap_set(mgr->shared_bitmap, first_bit, nbits);
>>>> + }
>>>> +
>>>> + return 0;
>>>> + }
>>>> +
>>>> + return ret;
>>>> +}
>>>>
>>>> Then, in memory_attribute_notify_populate(), we walk the bitmap again.
>>>>
>>>> Why?
>>>>
>>>> We just checked that it's all in the expected state, no?
>>>>
>>>>
>>>> virtio-mem doesn't handle it that way, so I'm curious why we would have
>>>> to do it here?
>>>
>>> I was concerned about the case where the guest issues a request that
>>> only partial of the range is in the desired state.
>>> I think the main problem is the policy for the guest conversion request.
>>> My current handling is:
>>>
>>> 1. When a conversion request is made for a range already in the desired
>>> state, the helper simply returns success.
>>
>> Yes.
>>
>>> 2. For requests involving a range partially in the desired state, only
>>> the necessary segments are converted, ensuring the entire range
>>> complies with the request efficiently.
>>
>>
>> Ah, now I get:
>>
>> + if ((shared_to_private && memory_attribute_is_range_discarded(mgr,
>> offset, size)) ||
>> + (!shared_to_private && memory_attribute_is_range_populated(mgr,
>> offset, size))) {
>> + return 0;
>> + }
>> +
>>
>> We're not failing if it might already partially be in the other state.
>>
>>> 3. In scenarios where a conversion request is declined by other systems,
>>> such as a failure from VFIO during notify_populate(), the helper will
>>> roll back the request, maintaining consistency.
>>>
>>> And the policy of virtio-mem is to refuse the state change if not all
>>> blocks are in the opposite state.
>>
>> Yes.
>>
>>>
>>> Actually, this part is still a uncertain to me.
>>>
>>
>> IIUC, the problem does not exist if we only convert a single page at a
>> time.
>>
>> Is there a known use case where such partial conversions could happen?
>
> I don't see such case yet. Actually, I'm trying to follow the behavior
> of KVM_SET_MEMORY_ATTRIBUTES ioctl during page conversion. In KVM, it
> doesn't reject the request if the whole range isn't in the opposite
> state. It just uses xa_store() to update it. Also, I don't see the spec
> says how to handle such case. To be robust, I just allow this special case.
>
>>
>>> BTW, per the status/bitmap track, the virtio-mem also changes the bitmap
>>> after the plug/unplug notifier. This is the same, correct?
>> Right. But because we reject these partial requests, we don't have to
>> traverse the bitmap and could just adjust the bitmap operations.
>
> Yes, If we treat it as a guest error/bug, we can adjust it.
Hi David, do you think which option is better? If prefer to reject the
partial requests, I'll change it in my next version.
>
>>
>