On Thu, Jan 08, 2026 at 10:38:04AM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 07, 2026 at 07:36:44PM -0800, Alex Mastro wrote:
> > This was inspired by QEMU's hw/vfio/region.c which also does this rounding
> > up
> > of size to the next power of two [1].
> >
> > I'm now realizing that's only necessary for regions with
> > VFIO_REGION_INFO_CAP_SPARSE_MMAP where there are multiple mmaps per region,
> > and
> > each mmap's size is less than the size of the BAR. Here, since we're
> > mapping the
> > entire BAR which must be pow2, it shouldn't be necessary.
>
> You only need to do this dance if you care about having large PTEs
> under the VMAs, which is probably something worth testing both
> scenarios.
Yep, makes sense. The test takes a long time to run without this due potentially
faulting a 128G BAR region 4K at a time during VFIO_IOMMU_MAP_DMA.
>
> > The intent of QEMU's mmap alignment code is imperfect in the SPARE_MMAP
> > case?
> > After a hole, the next mmap'able range could be some arbitrary page-aligned
> > offset into the region. It's not helpful mmap some region offset which is
> > maximally 4K-aligned at a 1G-aligned vaddr.
> >
> > I think to be optimal, QEMU should be attempting to align the vaddr for bar
> > mmaps such that
> >
> > vaddr % {2M,1G} == region_offset % {2M,1G}
> >
> > Would love someone to sanity check me on this. Kind of a diversion.
>
> What you write is correct. Ankit recently discovered this bug in
> qemu. It happens not just with SPARSE_MMAP but also when mmmaping
> around the MSI-X hole..
Is my mental model broken? I thought MSI-X holes in a VFIO-exposed BAR region
implied SPARSE_MMAP? I didn't think there was another way for the uapi to
express hole-yness.
>
> I also advocated for what you write here that qemu should ensure:
>
> vaddr % region_size == region_offset % region_size
Why region_size out of curiosity? Assuming perfect knowledge of kernel internals
I would have expected something like this:
diff --git a/hw/vfio/region.c b/hw/vfio/region.c
index ca75ab1be4..1d8595e808 100644
--- a/hw/vfio/region.c
+++ b/hw/vfio/region.c
@@ -238,6 +238,18 @@ static void vfio_subregion_unmap(VFIORegion *region, int
index)
region->mmaps[index].mmap = NULL;
}
+/*
+ * Return the next value greater than or equal to `input` such that
+ * (value % align) == offset.
+ */
+static size_t align_offset(size_t input, size_t offset, size_t align)
+{
+ size_t remainder = input % align;
+ size_t delta = (align + offset - remainder) % align;
+
+ return input + delta;
+}
+
int vfio_region_mmap(VFIORegion *region)
{
int i, ret, prot = 0;
@@ -252,7 +264,11 @@ int vfio_region_mmap(VFIORegion *region)
prot |= region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0;
for (i = 0; i < region->nr_mmaps; i++) {
- size_t align = MIN(1ULL << ctz64(region->mmaps[i].size), 1 * GiB);
+ size_t size = region->mmaps[i].size;
+ size_t offs = region->mmaps[i].offset;
+ size_t align = size >= GiB ? GiB :
+ size >= 2 * MiB ? 2 * MiB :
+ getpagesize();
void *map_base, *map_align;
/*
@@ -275,7 +291,7 @@ int vfio_region_mmap(VFIORegion *region)
fd = vfio_device_get_region_fd(region->vbasedev, region->nr);
- map_align = (void *)ROUND_UP((uintptr_t)map_base, (uintptr_t)align);
+ map_align = (void *)align_offset((size_t)map_base, offs % align,
align);
munmap(map_base, map_align - map_base);
munmap(map_align + region->mmaps[i].size,
align - (map_align - map_base));