common: Support device dirty page tracking with vIOMMU

Joao Martins Fri, 08 Sep 2023 03:12:43 -0700

On 08/09/2023 07:11, Duan, Zhenzhong wrote:
> Hi Joao,
> 
> On 6/23/2023 5:48 AM, Joao Martins wrote:
>> Currently, device dirty page tracking with vIOMMU is not supported,
>> and a blocker is added and the migration is prevented.
>>
>> When vIOMMU is used, IOVA ranges are DMA mapped/unmapped on the fly as
>> requesting by the vIOMMU. These IOVA ranges can potentially be mapped
>> anywhere in the vIOMMU IOVA space as advertised by the VMM.
>>
>> To support device dirty tracking when vIOMMU enabled instead create the
>> dirty ranges based on the vIOMMU provided limits, which leads to the
>> tracking of the whole IOVA space regardless of what devices use.
>>
>> Signed-off-by: Avihai Horon <avih...@nvidia.com>
>> Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
>> ---
>>   include/hw/vfio/vfio-common.h |  1 +
>>   hw/vfio/common.c              | 58 +++++++++++++++++++++++++++++------
>>   hw/vfio/pci.c                 |  7 +++++
>>   3 files changed, 56 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index f41860988d6b..c4bafad084b4 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -71,6 +71,7 @@ typedef struct VFIOMigration {
>>   typedef struct VFIOAddressSpace {
>>       AddressSpace *as;
>>       bool no_dma_translation;
>> +    hwaddr max_iova;
>>       QLIST_HEAD(, VFIOContainer) containers;
>>       QLIST_ENTRY(VFIOAddressSpace) list;
>>   } VFIOAddressSpace;
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index ecfb9afb3fb6..85fddef24026 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -428,6 +428,25 @@ static bool vfio_viommu_preset(void)
>>       return false;
>>   }
>>   +static int vfio_viommu_get_max_iova(hwaddr *max_iova)
>> +{
>> +    VFIOAddressSpace *space;
>> +
>> +    *max_iova = 0;
>> +
>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>> +        if (space->as == &address_space_memory) {
>> +            continue;
>> +        }
> 
> Just curious why address_space_memory is bypassed?
>


But address_space_memory part is done by memory listeners, but I see your point
conceptually on not considering it

> Imagine two vfio devices linked to two host bridge, one has bypass_iommu set
> 
> and the other not. Don't we need to include the guest memory ranges in
> 
> address_space_memory?

I am probably making a bad assumption that vIOMMU maximum adress space is a
superset of the CPU address space. But as you just reminded me, there is a user
case where all it takes is one bypass_iommu=true on a 2T guest setup with
aw-bits=39.

I'll rework this to consider this.

> 
>> +
>> +        if (*max_iova < space->max_iova) {
>> +            *max_iova = space->max_iova;
>> +        }
>> +    }
>> +
>> +    return *max_iova == 0;
>> +}
>> +
>>   int vfio_block_giommu_migration(Error **errp)
>>   {
>>       int ret;
>> @@ -1464,10 +1483,11 @@ static const MemoryListener
>> vfio_dirty_tracking_listener = {
>>       .region_add = vfio_listener_dirty_tracking_update,
>>   };
>>   -static void vfio_dirty_tracking_init(VFIOContainer *container,
>> +static int vfio_dirty_tracking_init(VFIOContainer *container,
>>                                        VFIODirtyRanges *ranges)
>>   {
>>       VFIODirtyRangesListener dirty;
>> +    int ret;
>>         memset(&dirty, 0, sizeof(dirty));
>>       dirty.ranges.min32 = UINT32_MAX;
>> @@ -1475,17 +1495,29 @@ static void vfio_dirty_tracking_init(VFIOContainer
>> *container,
>>       dirty.listener = vfio_dirty_tracking_listener;
>>       dirty.container = container;
>>   -    memory_listener_register(&dirty.listener,
>> -                             container->space->as);
>> +    if (vfio_viommu_preset()) {
>> +        hwaddr iommu_max_iova;
>> +
>> +        ret = vfio_viommu_get_max_iova(&iommu_max_iova);
>> +        if (ret) {
>> +            return -EINVAL;
>> +        }
>> +
>> +        vfio_dirty_tracking_update(0, iommu_max_iova, &dirty.ranges);
>> +    } else {
>> +        memory_listener_register(&dirty.listener,
>> +                                 container->space->as);
>> +        /*
>> +         * The memory listener is synchronous, and used to calculate the 
>> range
>> +         * to dirty tracking. Unregister it after we are done as we are not
>> +         * interested in any follow-up updates.
>> +         */
>> +        memory_listener_unregister(&dirty.listener);
>> +    }
>>         *ranges = dirty.ranges;
>>   -    /*
>> -     * The memory listener is synchronous, and used to calculate the range
>> -     * to dirty tracking. Unregister it after we are done as we are not
>> -     * interested in any follow-up updates.
>> -     */
>> -    memory_listener_unregister(&dirty.listener);
>> +    return 0;
>>   }
>>     static void vfio_devices_dma_logging_stop(VFIOContainer *container)
>> @@ -1590,7 +1622,13 @@ static int 
>> vfio_devices_dma_logging_start(VFIOContainer
>> *container)
>>       VFIOGroup *group;
>>       int ret = 0;
>>   -    vfio_dirty_tracking_init(container, &ranges);
>> +    ret = vfio_dirty_tracking_init(container, &ranges);
>> +    if (ret) {
>> +        error_report("Failed to init DMA logging ranges, err %d",
>> +                      ret);
>> +        return -EOPNOTSUPP;
>> +    }
>> +
>>       feature = vfio_device_feature_dma_logging_start_create(container,
>>                                                              &ranges);
> 
> No clear how much dirty range size could impact device dirty tracking.
> 
> Maybe some devices linking to vIOMMU with small aw_bits or bypassing vIOMMU
> 

So, my intended usecase with this series starts with DMA_TRANSLATION=off, where
vIOMMU is restricted to interrupt remapping, yet guest only uses it for
interrupt remapping. Right now only supported by intel-iommu, but my
understanding of AMD IOMMU is that is also possible there (haven't checked
smmu-v3). This unblocks guests with more >255. I have another patch separate to
this that hopefully relaxes vIOMMU blockage for these older guests.

Now with real emulated vIOMMU DMA translation usage, intel-iommu there's only
39, 48, 57 address space width (mimmicing the levels). And it's true that only
the first value is supportable and somewhat limiting as you say.

virtio-iommu has more going there as you can limit input iova to be the size of
the CPU address space. Eric latest series[0] is actually nice on that end of
fixing that issue (my alternative I had there was to introduce a gaw-bits
equivalent, but it's better done like [0]).

[0]
https://lore.kernel.org/qemu-devel/20230904080451.424731-1-eric.au...@redhat.com/

> with small guest memory could benefit if we use per address space's dirty 
> range
> 


Yeah, I'll need to fix that, when there's big guest memory and small vIOMMU
address space. Hopefully I picked up on your comment right :)

Re: [PATCH v4 12/15] vfio/common: Support device dirty page tracking with vIOMMU

Reply via email to