From: Stanislav Kinsburskii <[email protected]> Sent: Friday,
January 2, 2026 3:35 PM
>
> On Fri, Jan 02, 2026 at 09:13:31PM +0000, Michael Kelley wrote:
> > From: Stanislav Kinsburskii <[email protected]> Sent:
> > Friday, January 2, 2026 12:03 PM
> > >
> > > On Fri, Jan 02, 2026 at 06:04:56PM +0000, Michael Kelley wrote:
> > > > From: Stanislav Kinsburskii <[email protected]> Sent:
> > > > Friday, January 2, 2026 9:43 AM
> > > > >
> > > > > On Tue, Dec 23, 2025 at 07:17:23PM +0000, Michael Kelley wrote:
> > > > > > From: Stanislav Kinsburskii <[email protected]>
> > > > > > Sent: Tuesday, December 23, 2025 8:26 AM
> > > > > > >
> > > > > > > On Tue, Dec 23, 2025 at 03:51:22PM +0000, Michael Kelley wrote:
> > > > > > > > From: Michael Kelley Sent: Monday, December 22, 2025 10:25 AM
> > > > > > > > >
> > > > > > > > [snip]
> > > > > > > > >
> > > > > > > > > Separately, in looking at this, I spotted another potential
> > > > > > > > > problem with
> > > > > > > > > 2 Meg mappings that somewhat depends on hypervisor behavior
> > > > > > > > > that I'm
> > > > > > > > > not clear on. To create a new region, the user space VMM
> > > > > > > > > issues the
> > > > > > > > > MSHV_GET_GUEST_MEMORY ioctl, specifying the userspace
> > > > > > > > > address, the
> > > > > > > > > size, and the guest PFN. The only requirement on these values
> > > > > > > > > is that the
> > > > > > > > > userspace address and size be page aligned. But suppose a 4
> > > > > > > > > Meg region is
> > > > > > > > > specified where the userspace address and the guest PFN have
> > > > > > > > > different
> > > > > > > > > offsets modulo 2 Meg. The userspace address range gets
> > > > > > > > > populated first,
> > > > > > > > > and may contain a 2 Meg large page. Then when
> > > > > > > > > mshv_chunk_stride()
> > > > > > > > > detects a 2 Meg aligned guest PFN so HVCALL_MAP_GPA_PAGES can
> > > > > > > > > be told
> > > > > > > > > to create a 2 Meg mapping for the guest, the corresponding
> > > > > > > > > system PFN in
> > > > > > > > > the page array may not be 2 Meg aligned. What does the
> > > > > > > > > hypervisor do in
> > > > > > > > > this case? It can't create a 2 Meg mapping, right? So does it
> > > > > > > > > silently fallback
> > > > > > > > > to creating 4K mappings, or does it return an error?
> > > > > > > > > Returning an error would
> > > > > > > > > seem to be problematic for movable pages because the error
> > > > > > > > > wouldn't
> > > > > > > > > occur until the guest VM is running and takes a range fault
> > > > > > > > > on the region.
> > > > > > > > > Silently falling back to creating 4K mappings has performance
> > > > > > > > > implications,
> > > > > > > > > though I guess it would work. My question is whether the
> > > > > > > > > MSHV_GET_GUEST_MEMORY ioctl should detect this case and
> > > > > > > > > return an
> > > > > > > > > error immediately.
> > > > > > > > >
> > > > > > > >
> > > > > > > > In thinking about this more, I can answer my own question about
> > > > > > > > the
> > > > > > > > hypervisor behavior. When HVCALL_MAP_GPA_PAGES is set, the full
> > > > > > > > list of 4K system PFNs is not provided as an input to the
> > > > > > > > hypercall, so
> > > > > > > > the hypervisor cannot silently fall back to 4K mappings.
> > > > > > > > Assuming
> > > > > > > > sequential PFNs would be wrong, so it must return an error if
> > > > > > > > the
> > > > > > > > alignment of a system PFN isn't on a 2 Meg boundary.
> > > > > > > >
> > > > > > > > For a pinned region, this error happens in mshv_region_map() as
> > > > > > > > called from mshv_prepare_pinned_region(), so will propagate
> > > > > > > > back
> > > > > > > > to the ioctl. But the error happens only if
> > > > > > > > pin_user_pages_fast()
> > > > > > > > allocates one or more 2 Meg pages. So creating a pinned region
> > > > > > > > where the guest PFN and userspace address have different offsets
> > > > > > > > modulo 2 Meg might or might not succeed.
> > > > > > > >
> > > > > > > > For a movable region, the error probably can't occur.
> > > > > > > > mshv_region_handle_gfn_fault() builds an aligned 2 Meg chunk
> > > > > > > > around the faulting guest PFN. mshv_region_range_fault() then
> > > > > > > > determines the corresponding userspace addr, which won't be on
> > > > > > > > a 2 Meg boundary, so the allocated memory won't contain a 2 Meg
> > > > > > > > page. With no 2 Meg pages, mshv_region_remap_pages() will
> > > > > > > > always do 4K mappings and will succeed. The downside is that a
> > > > > > > > movable region with a guest PFN and userspace address with
> > > > > > > > different offsets never gets any 2 Meg pages or mappings.
> > > > > > > >
> > > > > > > > My conclusion is the same -- such misalignment should not be
> > > > > > > > allowed when creating a region that has the potential to use 2
> > > > > > > > Meg
> > > > > > > > pages. Regions less than 2 Meg in size could be excluded from
> > > > > > > > such
> > > > > > > > a requirement if there is benefit in doing so. It's possible to
> > > > > > > > have
> > > > > > > > regions up to (but not including) 4 Meg where the alignment
> > > > > > > > prevents
> > > > > > > > having a 2 Meg page, and those could also be excluded from the
> > > > > > > > requirement.
> > > > > > > >
> > > > > > >
> > > > > > > I'm not sure I understand the problem.
> > > > > > > There are three cases to consider:
> > > > > > > 1. Guest mapping, where page sizes are controlled by the guest.
> > > > > > > 2. Host mapping, where page sizes are controlled by the host.
> > > > > >
> > > > > > And by "host", you mean specifically the Linux instance running in
> > > > > > the
> > > > > > root partition. It hosts the VMM processes and creates the memory
> > > > > > regions for each guest.
> > > > > >
> > > > > > > 3. Hypervisor mapping, where page sizes are controlled by the
> > > > > > > hypervisor.
> > > > > > >
> > > > > > > The first case is not relevant here and is included for
> > > > > > > completeness.
> > > > > >
> > > > > > Agreed.
> > > > > >
> > > > > > >
> > > > > > > The second and third cases (host and hypervisor) share the memory
> > > > > > > layout,
> > > > > >
> > > > > > Right. More specifically, they are both operating on the same set
> > > > > > of physical
> > > > > > memory pages, and hence "share" a set of what I've referred to as
> > > > > > "system PFNs" (to distinguish from guest PFNs, or GFNs).
> > > > > >
> > > > > > > but it is up
> > > > > > > to each entity to decide which page sizes to use. For example,
> > > > > > > the host might map the
> > > > > > > proposed 4M region with only 4K pages, even if a 2M page is
> > > > > > > available in the middle.
> > > > > >
> > > > > > Agreed.
> > > > > >
> > > > > > > In this case, the host will map the memory as represented by 4K
> > > > > > > pages, but the hypervisor
> > > > > > > can still discover the 2M page in the middle and adjust its page
> > > > > > > tables to use a 2M page.
> > > > > >
> > > > > > Yes, that's possible, but subject to significant requirements. A 2M
> > > > > > page can be
> > > > > > used only if the underlying physical memory is a physically
> > > > > > contiguous 2M chunk.
> > > > > > Furthermore, that contiguous 2M chunk must start on a physical 2M
> > > > > > boundary,
> > > > > > and the virtual address to which it is being mapped must be on a 2M
> > > > > > boundary.
> > > > > > In the case of the host, that virtual address is the user space
> > > > > > address in the
> > > > > > user space process. In the case of the hypervisor, that "virtual
> > > > > > address" is the
> > > > > > the location in guest physical address space; i.e., the guest PFN
> > > > > > left-shifted 9
> > > > > > to be a guest physical address.
> > > > > >
> > > > > > These requirements are from the physical processor and its
> > > > > > requirements on
> > > > > > page table formats as specified by the hardware architecture.
> > > > > > Whereas the
> > > > > > page table entry for a 4K page contains the entire PFN, the page
> > > > > > table entry
> > > > > > for a 2M page omits the low order 9 bits of the PFN -- those bits
> > > > > > must be zero,
> > > > > > which is equivalent to requiring that the PFN be on a 2M boundary.
> > > > > > These
> > > > > > requirements apply to both host and hypervisor mappings.
> > > > > >
> > > > > > When MSHV code in the host creates a new pinned region via the
> > > > > > ioctl,
> > > > > > MSHV code first allocates memory for the region using
> > > > > > pin_user_pages_fast(),
> > > > > > which returns the system PFN for each page of physical memory that
> > > > > > is
> > > > > > allocated. If the host, at its discretion, allocates a 2M page,
> > > > > > then a series
> > > > > > of 512 sequential 4K PFNs is returned for that 2M page, and the
> > > > > > first of
> > > > > > the 512 sequential PFNs must have its low order 9 bits be zero.
> > > > > >
> > > > > > Then the MSHV ioctl makes the HVCALL_MAP_GPA_PAGES hypercall for
> > > > > > the hypervisor to map the allocated memory into the guest physical
> > > > > > address space at a particular guest PFN. If the allocated memory
> > > > > > contains
> > > > > > a 2M page, mshv_chunk_stride() will see a folio order of 9 for the
> > > > > > 2M page,
> > > > > > causing the HV_MAP_GPA_LARGE_PAGE flag to be set, which requests
> > > > > > that
> > > > > > the hypervisor do that mapping as a 2M large page. The hypercall
> > > > > > does not
> > > > > > have the option of dropping back to 4K page mappings in this case.
> > > > > > If
> > > > > > the 2M alignment of the system PFN is different from the 2M
> > > > > > alignment
> > > > > > of the target guest PFN, it's not possible to create the mapping
> > > > > > and the
> > > > > > hypercall fails.
> > > > > >
> > > > > > The core problem is that the same 2M of physical memory wants to be
> > > > > > mapped by the host as a 2M page and by the hypervisor as a 2M page.
> > > > > > That can't be done unless the host alignment (in the VMM virtual
> > > > > > address
> > > > > > space) and the guest physical address (i.e., the target guest PFN)
> > > > > > alignment
> > > > > > match and are both on 2M boundaries.
> > > > > >
> > > > >
> > > > > But why is it a problem? If both the host and the hypervisor can map
> > > > > ap
> > > > > huge page, but the guest can't, it's still a win, no?
> > > > > In other words, if VMM passes a host huge page aligned region as a
> > > > > guest
> > > > > unaligned, it's a VMM problem, not a hypervisor problem. And I' don't
> > > > > understand why would we want to prevent such cases.
> > > > >
> > > >
> > > > Fair enough -- mostly. If you want to allow the misaligned case and live
> > > > with not getting the 2M mapping in the guest, that works except in the
> > > > situation that I described above, where the HVCALL_MAP_GPA_PAGES
> > > > hypercall fails when creating a pinned region.
> > > >
> > > > The failure is flakey in that if the Linux in the root partition does
> > > > not
> > > > map any of the region as a 2M page, the hypercall succeeds and the
> > > > MSHV_GET_GUEST_MEMORY ioctl succeeds. But if the root partition
> > > > happens to map any of the region as a 2M page, the hypercall will fail,
> > > > and the MSHV_GET_GUEST_MEMORY ioctl will fail. Presumably such
> > > > flakey behavior is bad for the VMM.
> > > >
> > > > One solution is that mshv_chunk_stride() must return a stride > 1 only
> > > > if both the gfn (which it currently checks) AND the corresponding
> > > > userspace_addr are 2M aligned. Then the HVCALL_MAP_GPA_PAGES
> > > > hypercall will never have HV_MAP_GPA_LARGE_PAGE set for the
> > > > misaligned case, and the failure won't occur.
> > > >
> > >
> > > I think see your point, but I also think this issue doesn't exist,
> > > because map_chunk_stride() returns huge page stride iff page if:
> > > 1. the folio order is PMD_ORDER and
> > > 2. GFN is huge page aligned and
> > > 3. number of 4K pages is huge pages aligned.
> > >
> > > On other words, a host huge page won't be mapped as huge if the page
> > > can't be mapped as huge in the guest.
> >
> > OK, I'm missing how what you say is true. For pinned regions,
> > the memory is allocated and mapped into the host userspace address
> > first, as done by mshv_prepare_pinned_region() calling mshv_region_pin(),
> > which calls pin_user_pages_fast(). This is all done without considering
> > the GFN or GFN alignment. So one or more 2M pages might be allocated
> > and mapped in the host before any guest mapping is looked at. Agreed?
> >
>
> Agreed.
>
> > Then mshv_prepare_pinned_region() calls mshv_region_map() to do the
> > guest mapping. This eventually gets down to mshv_chunk_stride(). In
> > mshv_chunk_stride() when your conditions #2 and #3 are met, the
> > corresponding struct page argument to mshv_chunk_stride() may be a
> > 4K page that is in the middle of a 2M page instead of at the beginning
> > (if the region is mis-aligned). But the key point is that the 4K page in
> > the middle is part of a folio that will return a folio order of PMD_ORDER.
> > I.e., a folio order of PMD_ORDER is not sufficient to ensure that the
> > struct page arg is at the *start* of a 2M-aligned physical memory range
> > that can be mapped into the guest as a 2M page.
> >
>
> I'm trying to undestand how this can even happen, so please bear with
> me.
> In other words (and AFAIU), what you are saying in the following:
>
> 1. VMM creates a mapping with a huge page(s) (this implies that virtual
> address is huge page aligned, size is huge page aligned and physical
> pages are consequtive).
> 2. VMM tries to create a region via ioctl, but instead of passing the
> start of the region, is passes an offset into one of the the region's
> huge pages, and in the same time with the base GFN and the size huge
> page aligned (to meet the #2 and #3 conditions).
> 3. mshv_chunk_stride() sees a folio order of PMD_ORDER, and tries to map
> the corresponding pages as huge, which will be rejected by the
> hypervisor.
>
> Is this accurate?
Yes, pretty much. In Step 1, the VMM may just allocate some virtual
address space, and not do anything to populate it with physical pages.
So populating with any 2M pages may not happen until Step 2 when
the ioctl calls pin_user_pages_fast(). But *when* the virtual address
space gets populated with physical pages doesn't really matter. We
just know that it happens before the ioctl tries to map the memory
into the guest -- i.e., mshv_prepare_pinned_region() calls
mshv_region_map().
And yes, the problem is what you call out in Step 2: as input to the
ioctl, the fields "userspace_addr" and "guest_pfn" in struct
mshv_user_mem_region could have different alignments modulo 2M
boundaries. When they are different, that's what I'm calling a "mis-aligned
region", (referring to a struct mshv_mem_region that is created and
setup by the ioctl).
> A subseqeunt question: if it is accurate, why the driver needs to
> support this case? It looks like a VMM bug to me.
I don't know if the driver needs to support this case. That's a question
for the VMM people to answer. I wouldn't necessarily assume that the
VMM always allocates virtual address space with exactly the size and
alignment that matches the regions it creates with the ioctl. The
kernel ioctl doesn't care how the VMM allocates and manages its
virtual address space, so the VMM is free to do whatever it wants
in that regard, as long as it meets the requirements of the ioctl. So
the requirements of the ioctl in this case are something to be
negotiated with the VMM.
> Also, how should it support it? By rejecting such requests in the ioctl?
Rejecting requests to create a mis-aligned region is certainly one option
if the VMM agrees that's OK. The ioctl currently requires only that
"userspace_addr" and "size" be page aligned, so those requirements
could be tightened.
The other approach is to fix mshv_chunk_stride() to handle the
mis-aligned case. Doing so it even easier than I first envisioned.
I think this works:
@@ -49,7 +49,8 @@ static int mshv_chunk_stride(struct page *page,
*/
if (page_order &&
IS_ALIGNED(gfn, PTRS_PER_PMD) &&
- IS_ALIGNED(page_count, PTRS_PER_PMD))
+ IS_ALIGNED(page_count, PTRS_PER_PMD) &&
+ IS_ALIGNED(page_to_pfn(page), PTRS_PER_PMD))
return 1 << page_order;
return 1;
But as we discussed earlier, this fix means never getting 2M mappings
in the guest for a region that is mis-aligned.
Michael
>
> Thanks,
> Stanislav
>
> > The problem does *not* happen with a movable region, but the reasoning
> > is different. hmm_range_fault() is always called with a 2M range aligned
> > to the GFN, which in a mis-aligned region means that the host userspace
> > address is never 2M aligned. So hmm_range_fault() is never able to allocate
> > and map a 2M page. mshv_chunk_stride() will never get a folio order > 1,
> > and the hypercall is never asked to do a 2M mapping. Both host and guest
> > mappings will always be 4K and everything works.
> >
> > Michael
> >
> > > And this function is called for
> > > both movable and pinned region, so the hypercal should never fail due to
> > > huge page alignment issue.
> > >
> > > What do I miss here?
> > >
> > > Thanks,
> > > Stanislav
> > >
> > >
> > > > Michael
> > > >
> > > > >
> > > > > > Movable regions behave a bit differently because the memory for the
> > > > > > region is not allocated on the host "up front" when the region is
> > > > > > created.
> > > > > > The memory is faulted in as the guest runs, and the vagaries of the
> > > > > > current
> > > > > > MSHV in Linux code are such that 2M pages are never created on the
> > > > > > host
> > > > > > if the alignments don't match. HV_MAP_GPA_LARGE_PAGE is never passed
> > > > > > to the HVCALL_MAP_GPA_PAGES hypercall, so the hypervisor just does
> > > > > > 4K
> > > > > > mappings, which works even with the misalignment.
> > > > > >
> > > > > > >
> > > > > > > This adjustment happens at runtime. Could this be the missing
> > > > > > > detail here?
> > > > > >
> > > > > > Adjustments at runtime are a different topic from the issue I'm
> > > > > > raising,
> > > > > > though eventually there's some relationship. My issue occurs in the
> > > > > > creation of a new region, and the setting up of the initial
> > > > > > hypervisor
> > > > > > mapping. I haven't thought through the details of adjustments at
> > > > > > runtime.
> > > > > >
> > > > > > My usual caveats apply -- this is all "thought experiment". If I
> > > > > > had the
> > > > > > means do some runtime testing to confirm, I would. It's possible the
> > > > > > hypervisor is playing some trick I haven't envisioned, but I'm
> > > > > > skeptical of
> > > > > > that given the basics of how physical processors work with page
> > > > > > tables.
> > > > > >
> > > > > > Michael