>-----Original Message----- >From: Eric Auger <eric.au...@redhat.com> >Sent: Wednesday, July 5, 2023 2:20 PM >Subject: Re: [PATCH 1/2] virtio-iommu: Fix 64kB host page size VFIO device >assignment > >Hi Zhenzhong, >On 7/5/23 06:52, Duan, Zhenzhong wrote: >> Hi Eric, >> >>> -----Original Message----- >>> From: Eric Auger <eric.au...@redhat.com> >>> Sent: Tuesday, July 4, 2023 7:15 PM >>> Subject: [PATCH 1/2] virtio-iommu: Fix 64kB host page size VFIO device >>> assignment >>> >>> When running on a 64kB page size host and protecting a VFIO device >>> with the virtio-iommu, qemu crashes with this kind of message: >>> >>> qemu-kvm: virtio-iommu page mask 0xfffffffffffff000 is incompatible >>> with mask 0x20010000 >> Does 0x20010000 mean only 512MB and 64KB super page mapping is >> supported for host iommu hw? 4KB mapping not supported? >Yes that's correct. In that case the host has 64kB page and 4kB is not >supported. >> >> There is a check in guest kernel side hint only 4KB is supported, with >> this patch we force viommu->pgsize_bitmap to 0x20010000 >> and fail below check? Does this device work in guest? >> Please correct me if I understand wrong. >my guest also has 64kB so the check hereafter succeeds. effectively in >case your host has a larger page size than your guest it fails with >[ 2.147031] virtio-pci 0000:00:01.0: granule 0x10000 larger than >system page size 0x1000 >[ 7.231128] ixgbevf 0000:00:02.0: granule 0x10000 larger than system >page size 0x1000
Oh, I see, I took PAGE_SIZE as 4KB for granted. > >> >> static int viommu_domain_finalise(struct viommu_endpoint *vdev, >> struct iommu_domain *domain) >> { >> ... >> viommu_page_size = 1UL << __ffs(viommu->pgsize_bitmap); >> if (viommu_page_size > PAGE_SIZE) { >> dev_err(vdev->dev, >> "granule 0x%lx larger than system page size 0x%lx\n", >> viommu_page_size, PAGE_SIZE); >> return -ENODEV; >> } >> >> Another question is: Presume 0x20010000 does mean only 512MB and 64KB >> is supported. Is host hva->hpa mapping ensured to be compatible with at >least >> 64KB mapping? If host mmu only support 4KB mapping or other non- >compatible >> with 0x20010000, will vfio_dma_map() fail? >the page size mask is retrieved with VFIO_IOMMU_GET_INFO uapi >which returns host vfio_iommu_type1 vfio_iommu->pgsize_bitmap. This >latter is initialized to host PAGE_MASK and later restricted on >vfio_iommu_type1_attach_group though the vfio_update_pgsize_bitmap() >calls Understood, thanks for your analysis. > >so yes both IOMMU and CPU page size are garanteed to be compatible. > >> >>> qemu: hardware error: vfio: DMA mapping failed, unable to continue >>> >>> This is due to the fact the IOMMU MR corresponding to the VFIO device >>> is enabled very late on domain attach, after the machine init. >>> The device reports a minimal 64kB page size but it is too late to be >>> applied. virtio_iommu_set_page_size_mask() fails and this causes >>> vfio_listener_region_add() to end up with hw_error(); >>> >>> To work around this issue, we transiently enable the IOMMU MR on >>> machine init to collect the page size requirements and then restore >>> the bypass state. >>> >>> Fixes: 90519b9053 ("virtio-iommu: Add bypass mode support to assigned >>> device") >>> Signed-off-by: Eric Auger <eric.au...@redhat.com> >>> --- >>> include/hw/virtio/virtio-iommu.h | 2 ++ >>> hw/virtio/virtio-iommu.c | 30 ++++++++++++++++++++++++++++-- >>> hw/virtio/trace-events | 1 + >>> 3 files changed, 31 insertions(+), 2 deletions(-) >>> >>> diff --git a/include/hw/virtio/virtio-iommu.h b/include/hw/virtio/virtio- >>> iommu.h >>> index 2ad5ee320b..a93fc5383e 100644 >>> --- a/include/hw/virtio/virtio-iommu.h >>> +++ b/include/hw/virtio/virtio-iommu.h >>> @@ -61,6 +61,8 @@ struct VirtIOIOMMU { >>> QemuRecMutex mutex; >>> GTree *endpoints; >>> bool boot_bypass; >>> + Notifier machine_done; >>> + bool granule_frozen; >>> }; >>> >>> #endif >>> diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c >>> index 1cd258135d..1eaf81bab5 100644 >>> --- a/hw/virtio/virtio-iommu.c >>> +++ b/hw/virtio/virtio-iommu.c >>> @@ -24,6 +24,7 @@ >>> #include "hw/virtio/virtio.h" >>> #include "sysemu/kvm.h" >>> #include "sysemu/reset.h" >>> +#include "sysemu/sysemu.h" >>> #include "qapi/error.h" >>> #include "qemu/error-report.h" >>> #include "trace.h" >>> @@ -1106,12 +1107,12 @@ static int >>> virtio_iommu_set_page_size_mask(IOMMUMemoryRegion *mr, >>> } >>> >>> /* >>> - * After the machine is finalized, we can't change the mask anymore. If >by >>> + * Once the granule is frozen we can't change the mask anymore. If by >>> * chance the hotplugged device supports the same granule, we can still >>> * accept it. Having a different masks is possible but the guest will >>> use >>> * sub-optimal block sizes, so warn about it. >>> */ >>> - if (phase_check(PHASE_MACHINE_READY)) { >>> + if (s->granule_frozen) { >>> int new_granule = ctz64(new_mask); >>> int cur_granule = ctz64(cur_mask); >>> >>> @@ -1146,6 +1147,27 @@ static void virtio_iommu_system_reset(void >>> *opaque) >>> >>> } >>> >>> +static void virtio_iommu_freeze_granule(Notifier *notifier, void *data) >>> +{ >>> + VirtIOIOMMU *s = container_of(notifier, VirtIOIOMMU, machine_done); >>> + bool boot_bypass = s->config.bypass; >>> + int granule; >>> + >>> + /* >>> + * Transient IOMMU MR enable to collect page_size_mask requirement >>> + * through memory_region_iommu_set_page_size_mask() called by >>> + * VFIO region_add() callback >>> + */ >>> + s->config.bypass = 0; >>> + virtio_iommu_switch_address_space_all(s); >>> + /* restore default */ >>> + s->config.bypass = boot_bypass; >>> + virtio_iommu_switch_address_space_all(s); >>> + s->granule_frozen = true; >>> + granule = ctz64(s->config.page_size_mask); >>> + trace_virtio_iommu_freeze_granule(BIT(granule)); >>> +} >> It looks a bit heavy here just in order to get page_size_mask from host side. >> But maybe this is the only way with current interface. > >the problem comes from the fact the regions are aliased due to the >bypass and vfio_listener_region_add() does not get a chance to be called >until the actual domain attach. So I do not see any other way to >transiently enable the region. Agree. > >At least I could check if boot bypass is set before doing that dance. I >will respin with that. Make sense. Thanks Zhenzhong