From: Stanislav Kinsburskii <[email protected]> Sent: Friday, December 19, 2025 2:54 PM > > On Thu, Dec 18, 2025 at 07:41:24PM +0000, Michael Kelley wrote: > > From: Stanislav Kinsburskii <[email protected]> Sent: > > Tuesday, > December 16, 2025 4:41 PM > > > > > > Ensure that a stride larger than 1 (huge page) is only used when both > > > the guest frame number (gfn) and the operation size (page_count) are > > > aligned to the huge page size (PTRS_PER_PMD). This matches the > > > hypervisor requirement that map/unmap operations for huge pages must be > > > guest-aligned and cover a full huge page. > > > > > > Add mshv_chunk_stride() to encapsulate this alignment and page-order > > > validation, and plumb a huge_page flag into the region chunk handlers. > > > This prevents issuing large-page map/unmap/share operations that the > > > hypervisor would reject due to misaligned guest mappings. > > > > This code looks good to me on the surface. But I can only make an educated > > guess as to the hypervisor behavior in certain situations, and if my guess > > is > > correct there's still a flaw in one case. > > > > Consider the madvise() DONTNEED experiment that I previously called out. [1] > > I surmise that the intent of this patch is to make that case work correctly. > > When the .invalidate callback is made for the 32 Kbyte range embedded in > > a previously mapped 2 Meg page, this new code detects that case. It calls > > the > > hypervisor to remap the 32 Kbyte range for no access, and clears the 8 > > corresponding entries in the struct page array attached to the mshv region. > > The > > call to the hypervisor is made *without* the HV_MAP_GPA_LARGE_PAGE flag. > > Since the mapping was originally done *with* the HV_MAP_GPA_LARGE_PAGE > > flag, my guess is that the hypervisor is smart enough to handle this case by > > splitting the 2 Meg mapping it created, setting the 32 Kbyte range to no > > access, > > and returning "success". If my guess is correct, there's no problem here. > > > > But then there's a second .invalidate callback for the entire 2 Meg page. > > Here's > > the call stack: > > > > [ 194.259337] dump_stack+0x14/0x20 > > [ 194.259339] mhktest_invalidate+0x2a/0x40 [my dummy invalidate callback] > > [ 194.259342] __mmu_notifier_invalidate_range_start+0x1f4/0x250 > > [ 194.259347] __split_huge_pmd+0x14f/0x170 > > [ 194.259349] unmap_page_range+0x104d/0x1a00 > > [ 194.259358] unmap_single_vma+0x7d/0xc0 > > [ 194.259360] zap_page_range_single_batched+0xe0/0x1c0 > > [ 194.259363] madvise_vma_behavior+0xb01/0xc00 > > [ 194.259366] madvise_do_behavior.part.0+0x3cd/0x4a0 > > [ 194.259375] do_madvise+0xc7/0x170 > > [ 194.259380] __x64_sys_madvise+0x2f/0x40 > > [ 194.259382] x64_sys_call+0x1d77/0x21b0 > > [ 194.259385] do_syscall_64+0x56/0x640 > > [ 194.259388] entry_SYSCALL_64_after_hwframe+0x76/0x7e > > > > In __split_huge_pmd(), the .invalidate callback is made *before* the 2 Meg > > page is actually split by the root partition. So mshv_chunk_stride() > > returns "9" > > for the stride, and the hypervisor is called with HV_MAP_GPA_LARGE_PAGE > > set. My guess is that the hypervisor returns an error because it has already > > split the mapping. The whole point of this patch set is to avoid passing > > HV_MAP_GPA_LARGE_PAGE to the hypervisor when the hypervisor mapping > > is not a large page mapping, but this looks like a case where it still > > happens. > > > > My concern is solely from looking at the code and thinking about the > > problem, > > as I don't have an environment where I can test root partition interactions > > with the hypervisor. So maybe I'm missing something. Lemme know what you > > think ..... > > > > Yeah, I see your point: according to this stack, once a part of the page > is invalidated, the folio order remains the same until another invocation > of the same callback - this time for the whole huge > page - is made. Thus, the stride is still reported as the huge page size, > even though a part of the page has already been unmapped. > > This indeed looks like a flaw in the current approach, but it's actually > not. The reason is that upon the invalidation callback, the driver > simply remaps the whole huge page with no access (in this case, the PFNs > provided to the hypervisor are zero), and it's fine as the hypervisor > simply drops all the pages from the previous mapping and marks this page > as inaccessible. The only check the hypervisor makes in this case is > that both the GFN and mapping size are huge page aligned (which they are > in this case). > > I hope this clarifies the situation. Please let me know if you have any > other questions.
Thanks. Yes, this clarifies. My guess about the hypervisor behavior was wrong. Based on what you've said about what the hypervisor does, and further studying MSHV code, here's my recap of the HV_MAP_GPA_LARGE_PAGE flag: 1. The hypervisor uses the flag to determine the granularity (4K or 2M) of the mapping HVCALL_MAP_GPA_PAGES or HVCALL_UNMAP_GPA_PAGES will create/remove. As such, the hypercall "repcount" is in this granularity. GFNs, such as the target_gpa_base input parameter and GFNs in the pfn_array, are always 4K GFNs, but if the flag is set, a GFN is treated as the first 4K GFN in a contiguous 2M range. If the flag is set, the target_gpa_base GFN must be 2M aligned. 2. The hypervisor doesn't care whether any existing mapping is 4K or 2M. It always removes an existing mapping, including splitting any 2M mappings if necessary. Then if the operation is to create/re-create a mapping, it creates an appropriate new mapping. My error was in thinking that the flag had to match any existing mapping. But the behavior you've clarified is certainly better. It handles the vagaries of the Linux "mm" subsystem, which in one case in my original experiment (madvise) invalidates the small range, then the 2M range, but the other case (mprotect) invalidates the 2M range, then the small range. Since there's no documentation for these root partition hypercalls, it sure would be nice if this info could be captured in code comments for some future developer to benefit from. If that's not something you want to worry about, I could submit a patch later to add the code comments (subject to your review, of course). Separately, in looking at this, I spotted another potential problem with 2 Meg mappings that somewhat depends on hypervisor behavior that I'm not clear on. To create a new region, the user space VMM issues the MSHV_GET_GUEST_MEMORY ioctl, specifying the userspace address, the size, and the guest PFN. The only requirement on these values is that the userspace address and size be page aligned. But suppose a 4 Meg region is specified where the userspace address and the guest PFN have different offsets modulo 2 Meg. The userspace address range gets populated first, and may contain a 2 Meg large page. Then when mshv_chunk_stride() detects a 2 Meg aligned guest PFN so HVCALL_MAP_GPA_PAGES can be told to create a 2 Meg mapping for the guest, the corresponding system PFN in the page array may not be 2 Meg aligned. What does the hypervisor do in this case? It can't create a 2 Meg mapping, right? So does it silently fallback to creating 4K mappings, or does it return an error? Returning an error would seem to be problematic for movable pages because the error wouldn't occur until the guest VM is running and takes a range fault on the region. Silently falling back to creating 4K mappings has performance implications, though I guess it would work. My question is whether the MSHV_GET_GUEST_MEMORY ioctl should detect this case and return an error immediately. Michael > > > > [1] > > https://lore.kernel.org/linux-hyperv/sn6pr02mb4157978dfaa6c2584d0678e1d4...@sn6pr02mb4157.namprd02.prod.outlook.com/
