On Thu, Jan 08, 2026 at 10:38:04AM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 07, 2026 at 07:36:44PM -0800, Alex Mastro wrote:
> > This was inspired by QEMU's hw/vfio/region.c which also does this rounding 
> > up
> > of size to the next power of two [1].
> > 
> > I'm now realizing that's only necessary for regions with
> > VFIO_REGION_INFO_CAP_SPARSE_MMAP where there are multiple mmaps per region, 
> > and
> > each mmap's size is less than the size of the BAR. Here, since we're 
> > mapping the
> > entire BAR which must be pow2, it shouldn't be necessary.
> 
> You only need to do this dance if you care about having large PTEs
> under the VMAs, which is probably something worth testing both
> scenarios.

Yep, makes sense. The test takes a long time to run without this due potentially
faulting a 128G BAR region 4K at a time during VFIO_IOMMU_MAP_DMA.

> 
> > The intent of QEMU's mmap alignment code is imperfect in the SPARE_MMAP 
> > case?
> > After a hole, the next mmap'able range could be some arbitrary page-aligned
> > offset into the region. It's not helpful mmap some region offset which is
> > maximally 4K-aligned at a 1G-aligned vaddr.
> > 
> > I think to be optimal, QEMU should be attempting to align the vaddr for bar
> > mmaps such that
> > 
> > vaddr % {2M,1G} == region_offset % {2M,1G}
> > 
> > Would love someone to sanity check me on this. Kind of a diversion.
> 
> What you write is correct. Ankit recently discovered this bug in
> qemu. It happens not just with SPARSE_MMAP but also when mmmaping
> around the MSI-X hole..

Is my mental model broken? I thought MSI-X holes in a VFIO-exposed BAR region
implied SPARSE_MMAP? I didn't think there was another way for the uapi to
express hole-yness.

> 
> I also advocated for what you write here that qemu should ensure:
> 
>   vaddr % region_size == region_offset % region_size

Why region_size out of curiosity? Assuming perfect knowledge of kernel internals
I would have expected something like this:

diff --git a/hw/vfio/region.c b/hw/vfio/region.c
index ca75ab1be4..1d8595e808 100644
--- a/hw/vfio/region.c
+++ b/hw/vfio/region.c
@@ -238,6 +238,18 @@ static void vfio_subregion_unmap(VFIORegion *region, int 
index)
     region->mmaps[index].mmap = NULL;
 }
 
+/*
+ * Return the next value greater than or equal to `input` such that
+ * (value % align) == offset.
+ */
+static size_t align_offset(size_t input, size_t offset, size_t align)
+{
+    size_t remainder = input % align;
+    size_t delta = (align + offset - remainder) % align;
+
+    return input + delta;
+}
+
 int vfio_region_mmap(VFIORegion *region)
 {
     int i, ret, prot = 0;
@@ -252,7 +264,11 @@ int vfio_region_mmap(VFIORegion *region)
     prot |= region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0;
 
     for (i = 0; i < region->nr_mmaps; i++) {
-        size_t align = MIN(1ULL << ctz64(region->mmaps[i].size), 1 * GiB);
+        size_t size = region->mmaps[i].size;
+        size_t offs = region->mmaps[i].offset;
+        size_t align = size >= GiB ? GiB :
+                       size >= 2 * MiB ? 2 * MiB :
+                       getpagesize();
         void *map_base, *map_align;
 
         /*
@@ -275,7 +291,7 @@ int vfio_region_mmap(VFIORegion *region)
 
         fd = vfio_device_get_region_fd(region->vbasedev, region->nr);
 
-        map_align = (void *)ROUND_UP((uintptr_t)map_base, (uintptr_t)align);
+        map_align = (void *)align_offset((size_t)map_base, offs % align, 
align);
         munmap(map_base, map_align - map_base);
         munmap(map_align + region->mmaps[i].size,
                align - (map_align - map_base));

Reply via email to