Re: [PATCH] mshv: Align huge page stride with guest mapping

Stanislav Kinsburskii Tue, 23 Dec 2025 08:27:49 -0800

On Mon, Dec 22, 2025 at 06:25:02PM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <[email protected]> Sent: Friday, 
> December 19, 2025 2:54 PM
> > 
> > On Thu, Dec 18, 2025 at 07:41:24PM +0000, Michael Kelley wrote:
> > > From: Stanislav Kinsburskii <[email protected]> Sent: 
> > > Tuesday,
> > December 16, 2025 4:41 PM
> > > >
> > > > Ensure that a stride larger than 1 (huge page) is only used when both
> > > > the guest frame number (gfn) and the operation size (page_count) are
> > > > aligned to the huge page size (PTRS_PER_PMD). This matches the
> > > > hypervisor requirement that map/unmap operations for huge pages must be
> > > > guest-aligned and cover a full huge page.
> > > >
> > > > Add mshv_chunk_stride() to encapsulate this alignment and page-order
> > > > validation, and plumb a huge_page flag into the region chunk handlers.
> > > > This prevents issuing large-page map/unmap/share operations that the
> > > > hypervisor would reject due to misaligned guest mappings.
> > >
> > > This code looks good to me on the surface. But I can only make an educated
> > > guess as to the hypervisor behavior in certain situations, and if my 
> > > guess is
> > > correct there's still a flaw in one case.
> > >
> > > Consider the madvise() DONTNEED experiment that I previously called out. 
> > > [1]
> > > I surmise that the intent of this patch is to make that case work 
> > > correctly.
> > > When the .invalidate callback is made for the 32 Kbyte range embedded in
> > > a previously mapped 2 Meg page, this new code detects that case. It calls 
> > > the
> > > hypervisor to remap the 32 Kbyte range for no access, and clears the 8
> > > corresponding entries in the struct page array attached to the mshv 
> > > region. The
> > > call to the hypervisor is made *without* the HV_MAP_GPA_LARGE_PAGE flag.
> > > Since the mapping was originally done *with* the HV_MAP_GPA_LARGE_PAGE
> > > flag, my guess is that the hypervisor is smart enough to handle this case 
> > > by
> > > splitting the 2 Meg mapping it created, setting the 32 Kbyte range to no 
> > > access,
> > > and returning "success". If my guess is correct, there's no problem here.
> > >
> > > But then there's a second .invalidate callback for the entire 2 Meg page. 
> > > Here's
> > > the call stack:
> > >
> > > [  194.259337]  dump_stack+0x14/0x20
> > > [  194.259339]  mhktest_invalidate+0x2a/0x40  [my dummy invalidate 
> > > callback]
> > > [  194.259342]  __mmu_notifier_invalidate_range_start+0x1f4/0x250
> > > [  194.259347]  __split_huge_pmd+0x14f/0x170
> > > [  194.259349]  unmap_page_range+0x104d/0x1a00
> > > [  194.259358]  unmap_single_vma+0x7d/0xc0
> > > [  194.259360]  zap_page_range_single_batched+0xe0/0x1c0
> > > [  194.259363]  madvise_vma_behavior+0xb01/0xc00
> > > [  194.259366]  madvise_do_behavior.part.0+0x3cd/0x4a0
> > > [  194.259375]  do_madvise+0xc7/0x170
> > > [  194.259380]  __x64_sys_madvise+0x2f/0x40
> > > [  194.259382]  x64_sys_call+0x1d77/0x21b0
> > > [  194.259385]  do_syscall_64+0x56/0x640
> > > [  194.259388]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > >
> > > In __split_huge_pmd(), the .invalidate callback is made *before* the 2 Meg
> > > page is actually split by the root partition. So mshv_chunk_stride() 
> > > returns "9"
> > > for the stride, and the hypervisor is called with HV_MAP_GPA_LARGE_PAGE
> > > set. My guess is that the hypervisor returns an error because it has 
> > > already
> > > split the mapping. The whole point of this patch set is to avoid passing
> > > HV_MAP_GPA_LARGE_PAGE to the hypervisor when the hypervisor mapping
> > > is not a large page mapping, but this looks like a case where it still 
> > > happens.
> > >
> > > My concern is solely from looking at the code and thinking about the 
> > > problem,
> > > as I don't have an environment where I can test root partition 
> > > interactions
> > > with the hypervisor. So maybe I'm missing something. Lemme know what you
> > > think .....
> > >
> > 
> > Yeah, I see your point: according to this stack, once a part of the page
> > is invalidated, the folio order remains the same until another invocation
> > of the same callback - this time for the whole huge
> > page - is made. Thus, the stride is still reported as the huge page size,
> > even though a part of the page has already been unmapped.
> > 
> > This indeed looks like a flaw in the current approach, but it's actually
> > not. The reason is that upon the invalidation callback, the driver
> > simply remaps the whole huge page with no access (in this case, the PFNs
> > provided to the hypervisor are zero), and it's fine as the hypervisor
> > simply drops all the pages from the previous mapping and marks this page
> > as inaccessible. The only check the hypervisor makes in this case is
> > that both the GFN and mapping size are huge page aligned (which they are
> > in this case).
> > 
> > I hope this clarifies the situation. Please let me know if you have any
> > other questions.
> 
> Thanks. Yes, this clarifies. My guess about the hypervisor behavior was wrong.
> Based on what you've said about what the hypervisor does, and further studying
> MSHV code, here's my recap of the HV_MAP_GPA_LARGE_PAGE flag:
> 
> 1. The hypervisor uses the flag to determine the granularity (4K or 2M) of the
> mapping HVCALL_MAP_GPA_PAGES or HVCALL_UNMAP_GPA_PAGES will
> create/remove. As such, the hypercall "repcount" is in this granularity. GFNs,
> such as the target_gpa_base input parameter and GFNs in the pfn_array, are
> always 4K GFNs, but if the flag is set, a GFN is treated as the first 4K GFN 
> in
> a contiguous 2M range. If the flag is set, the target_gpa_base GFN must be
> 2M aligned.
> 
> 2. The hypervisor doesn't care whether any existing mapping is 4K or 2M. It
> always removes an existing mapping, including splitting any 2M mappings if
> necessary. Then if the operation is to create/re-create a mapping, it creates
> an appropriate new mapping.
> 
> My error was in thinking that the flag had to match any existing mapping.
> But the behavior you've clarified is certainly better. It handles the vagaries
> of the Linux "mm" subsystem, which in one case in my original experiment
> (madvise) invalidates the small range, then the 2M range, but the other
> case (mprotect) invalidates the 2M range, then the small range.
> 
> Since there's no documentation for these root partition hypercalls, it sure
> would be nice if this info could be captured in code comments for some
> future developer to benefit from. If that's not something you want to
> worry about, I could submit a patch later to add the code comments
> (subject to your review, of course).
>


Please, feel free ti add the comments you see fit. I think you'll do it
better as you have a better understanding of what needs to be
documented.

Thanks,
Stanislav

> Separately, in looking at this, I spotted another potential problem with
> 2 Meg mappings that somewhat depends on hypervisor behavior that I'm
> not clear on. To create a new region, the user space VMM issues the
> MSHV_GET_GUEST_MEMORY ioctl, specifying the userspace address, the
> size, and the guest PFN. The only requirement on these values is that the
> userspace address and size be page aligned. But suppose a 4 Meg region is
> specified where the userspace address and the guest PFN have different
> offsets modulo 2 Meg. The userspace address range gets populated first,
> and may contain a 2 Meg large page. Then when mshv_chunk_stride()
> detects a 2 Meg aligned guest PFN so HVCALL_MAP_GPA_PAGES can be told
> to create a 2 Meg mapping for the guest, the corresponding system PFN in
> the page array may not be 2 Meg aligned. What does the hypervisor do in
> this case? It can't create a 2 Meg mapping, right? So does it silently 
> fallback
> to creating 4K mappings, or does it return an error? Returning an error would
> seem to be problematic for movable pages because the error wouldn't
> occur until the guest VM is running and takes a range fault on the region.
> Silently falling back to creating 4K mappings has performance implications,
> though I guess it would work. My question is whether the
> MSHV_GET_GUEST_MEMORY ioctl should detect this case and return an
> error immediately.
> 
> Michael
> 
> > >
> > > [1] 
> > > https://lore.kernel.org/linux-hyperv/sn6pr02mb4157978dfaa6c2584d0678e1d4...@sn6pr02mb4157.namprd02.prod.outlook.com/

Re: [PATCH] mshv: Align huge page stride with guest mapping

Reply via email to