From: Stanislav Kinsburskii <[email protected]> Sent: Friday, January 2, 2026 9:43 AM > > On Tue, Dec 23, 2025 at 07:17:23PM +0000, Michael Kelley wrote: > > From: Stanislav Kinsburskii <[email protected]> Sent: > > Tuesday, > December 23, 2025 8:26 AM > > > > > > On Tue, Dec 23, 2025 at 03:51:22PM +0000, Michael Kelley wrote: > > > > From: Michael Kelley Sent: Monday, December 22, 2025 10:25 AM > > > > > > > > > [snip] > > > > > > > > > > Separately, in looking at this, I spotted another potential problem > > > > > with > > > > > 2 Meg mappings that somewhat depends on hypervisor behavior that I'm > > > > > not clear on. To create a new region, the user space VMM issues the > > > > > MSHV_GET_GUEST_MEMORY ioctl, specifying the userspace address, the > > > > > size, and the guest PFN. The only requirement on these values is that > > > > > the > > > > > userspace address and size be page aligned. But suppose a 4 Meg > > > > > region is > > > > > specified where the userspace address and the guest PFN have different > > > > > offsets modulo 2 Meg. The userspace address range gets populated > > > > > first, > > > > > and may contain a 2 Meg large page. Then when mshv_chunk_stride() > > > > > detects a 2 Meg aligned guest PFN so HVCALL_MAP_GPA_PAGES can be told > > > > > to create a 2 Meg mapping for the guest, the corresponding system PFN > > > > > in > > > > > the page array may not be 2 Meg aligned. What does the hypervisor do > > > > > in > > > > > this case? It can't create a 2 Meg mapping, right? So does it > > > > > silently fallback > > > > > to creating 4K mappings, or does it return an error? Returning an > > > > > error would > > > > > seem to be problematic for movable pages because the error wouldn't > > > > > occur until the guest VM is running and takes a range fault on the > > > > > region. > > > > > Silently falling back to creating 4K mappings has performance > > > > > implications, > > > > > though I guess it would work. My question is whether the > > > > > MSHV_GET_GUEST_MEMORY ioctl should detect this case and return an > > > > > error immediately. > > > > > > > > > > > > > In thinking about this more, I can answer my own question about the > > > > hypervisor behavior. When HVCALL_MAP_GPA_PAGES is set, the full > > > > list of 4K system PFNs is not provided as an input to the hypercall, so > > > > the hypervisor cannot silently fall back to 4K mappings. Assuming > > > > sequential PFNs would be wrong, so it must return an error if the > > > > alignment of a system PFN isn't on a 2 Meg boundary. > > > > > > > > For a pinned region, this error happens in mshv_region_map() as > > > > called from mshv_prepare_pinned_region(), so will propagate back > > > > to the ioctl. But the error happens only if pin_user_pages_fast() > > > > allocates one or more 2 Meg pages. So creating a pinned region > > > > where the guest PFN and userspace address have different offsets > > > > modulo 2 Meg might or might not succeed. > > > > > > > > For a movable region, the error probably can't occur. > > > > mshv_region_handle_gfn_fault() builds an aligned 2 Meg chunk > > > > around the faulting guest PFN. mshv_region_range_fault() then > > > > determines the corresponding userspace addr, which won't be on > > > > a 2 Meg boundary, so the allocated memory won't contain a 2 Meg > > > > page. With no 2 Meg pages, mshv_region_remap_pages() will > > > > always do 4K mappings and will succeed. The downside is that a > > > > movable region with a guest PFN and userspace address with > > > > different offsets never gets any 2 Meg pages or mappings. > > > > > > > > My conclusion is the same -- such misalignment should not be > > > > allowed when creating a region that has the potential to use 2 Meg > > > > pages. Regions less than 2 Meg in size could be excluded from such > > > > a requirement if there is benefit in doing so. It's possible to have > > > > regions up to (but not including) 4 Meg where the alignment prevents > > > > having a 2 Meg page, and those could also be excluded from the > > > > requirement. > > > > > > > > > > I'm not sure I understand the problem. > > > There are three cases to consider: > > > 1. Guest mapping, where page sizes are controlled by the guest. > > > 2. Host mapping, where page sizes are controlled by the host. > > > > And by "host", you mean specifically the Linux instance running in the > > root partition. It hosts the VMM processes and creates the memory > > regions for each guest. > > > > > 3. Hypervisor mapping, where page sizes are controlled by the hypervisor. > > > > > > The first case is not relevant here and is included for completeness. > > > > Agreed. > > > > > > > > The second and third cases (host and hypervisor) share the memory layout, > > > > Right. More specifically, they are both operating on the same set of > > physical > > memory pages, and hence "share" a set of what I've referred to as > > "system PFNs" (to distinguish from guest PFNs, or GFNs). > > > > > but it is up > > > to each entity to decide which page sizes to use. For example, the host > > > might map the > > > proposed 4M region with only 4K pages, even if a 2M page is available in > > > the middle. > > > > Agreed. > > > > > In this case, the host will map the memory as represented by 4K pages, > > > but the hypervisor > > > can still discover the 2M page in the middle and adjust its page tables > > > to use a 2M page. > > > > Yes, that's possible, but subject to significant requirements. A 2M page > > can be > > used only if the underlying physical memory is a physically contiguous 2M > > chunk. > > Furthermore, that contiguous 2M chunk must start on a physical 2M boundary, > > and the virtual address to which it is being mapped must be on a 2M > > boundary. > > In the case of the host, that virtual address is the user space address in > > the > > user space process. In the case of the hypervisor, that "virtual address" > > is the > > the location in guest physical address space; i.e., the guest PFN > > left-shifted 9 > > to be a guest physical address. > > > > These requirements are from the physical processor and its requirements on > > page table formats as specified by the hardware architecture. Whereas the > > page table entry for a 4K page contains the entire PFN, the page table entry > > for a 2M page omits the low order 9 bits of the PFN -- those bits must be > > zero, > > which is equivalent to requiring that the PFN be on a 2M boundary. These > > requirements apply to both host and hypervisor mappings. > > > > When MSHV code in the host creates a new pinned region via the ioctl, > > MSHV code first allocates memory for the region using pin_user_pages_fast(), > > which returns the system PFN for each page of physical memory that is > > allocated. If the host, at its discretion, allocates a 2M page, then a > > series > > of 512 sequential 4K PFNs is returned for that 2M page, and the first of > > the 512 sequential PFNs must have its low order 9 bits be zero. > > > > Then the MSHV ioctl makes the HVCALL_MAP_GPA_PAGES hypercall for > > the hypervisor to map the allocated memory into the guest physical > > address space at a particular guest PFN. If the allocated memory contains > > a 2M page, mshv_chunk_stride() will see a folio order of 9 for the 2M page, > > causing the HV_MAP_GPA_LARGE_PAGE flag to be set, which requests that > > the hypervisor do that mapping as a 2M large page. The hypercall does not > > have the option of dropping back to 4K page mappings in this case. If > > the 2M alignment of the system PFN is different from the 2M alignment > > of the target guest PFN, it's not possible to create the mapping and the > > hypercall fails. > > > > The core problem is that the same 2M of physical memory wants to be > > mapped by the host as a 2M page and by the hypervisor as a 2M page. > > That can't be done unless the host alignment (in the VMM virtual address > > space) and the guest physical address (i.e., the target guest PFN) alignment > > match and are both on 2M boundaries. > > > > But why is it a problem? If both the host and the hypervisor can map ap > huge page, but the guest can't, it's still a win, no? > In other words, if VMM passes a host huge page aligned region as a guest > unaligned, it's a VMM problem, not a hypervisor problem. And I' don't > understand why would we want to prevent such cases. >
Fair enough -- mostly. If you want to allow the misaligned case and live with not getting the 2M mapping in the guest, that works except in the situation that I described above, where the HVCALL_MAP_GPA_PAGES hypercall fails when creating a pinned region. The failure is flakey in that if the Linux in the root partition does not map any of the region as a 2M page, the hypercall succeeds and the MSHV_GET_GUEST_MEMORY ioctl succeeds. But if the root partition happens to map any of the region as a 2M page, the hypercall will fail, and the MSHV_GET_GUEST_MEMORY ioctl will fail. Presumably such flakey behavior is bad for the VMM. One solution is that mshv_chunk_stride() must return a stride > 1 only if both the gfn (which it currently checks) AND the corresponding userspace_addr are 2M aligned. Then the HVCALL_MAP_GPA_PAGES hypercall will never have HV_MAP_GPA_LARGE_PAGE set for the misaligned case, and the failure won't occur. Michael > > > Movable regions behave a bit differently because the memory for the > > region is not allocated on the host "up front" when the region is created. > > The memory is faulted in as the guest runs, and the vagaries of the current > > MSHV in Linux code are such that 2M pages are never created on the host > > if the alignments don't match. HV_MAP_GPA_LARGE_PAGE is never passed > > to the HVCALL_MAP_GPA_PAGES hypercall, so the hypervisor just does 4K > > mappings, which works even with the misalignment. > > > > > > > > This adjustment happens at runtime. Could this be the missing detail here? > > > > Adjustments at runtime are a different topic from the issue I'm raising, > > though eventually there's some relationship. My issue occurs in the > > creation of a new region, and the setting up of the initial hypervisor > > mapping. I haven't thought through the details of adjustments at runtime. > > > > My usual caveats apply -- this is all "thought experiment". If I had the > > means do some runtime testing to confirm, I would. It's possible the > > hypervisor is playing some trick I haven't envisioned, but I'm skeptical of > > that given the basics of how physical processors work with page tables. > > > > Michael
