From: Stanislav Kinsburskii <[email protected]> Sent: Tuesday, December 16, 2025 4:41 PM > > Ensure that a stride larger than 1 (huge page) is only used when both > the guest frame number (gfn) and the operation size (page_count) are > aligned to the huge page size (PTRS_PER_PMD). This matches the > hypervisor requirement that map/unmap operations for huge pages must be > guest-aligned and cover a full huge page. > > Add mshv_chunk_stride() to encapsulate this alignment and page-order > validation, and plumb a huge_page flag into the region chunk handlers. > This prevents issuing large-page map/unmap/share operations that the > hypervisor would reject due to misaligned guest mappings.
This code looks good to me on the surface. But I can only make an educated guess as to the hypervisor behavior in certain situations, and if my guess is correct there's still a flaw in one case. Consider the madvise() DONTNEED experiment that I previously called out. [1] I surmise that the intent of this patch is to make that case work correctly. When the .invalidate callback is made for the 32 Kbyte range embedded in a previously mapped 2 Meg page, this new code detects that case. It calls the hypervisor to remap the 32 Kbyte range for no access, and clears the 8 corresponding entries in the struct page array attached to the mshv region. The call to the hypervisor is made *without* the HV_MAP_GPA_LARGE_PAGE flag. Since the mapping was originally done *with* the HV_MAP_GPA_LARGE_PAGE flag, my guess is that the hypervisor is smart enough to handle this case by splitting the 2 Meg mapping it created, setting the 32 Kbyte range to no access, and returning "success". If my guess is correct, there's no problem here. But then there's a second .invalidate callback for the entire 2 Meg page. Here's the call stack: [ 194.259337] dump_stack+0x14/0x20 [ 194.259339] mhktest_invalidate+0x2a/0x40 [my dummy invalidate callback] [ 194.259342] __mmu_notifier_invalidate_range_start+0x1f4/0x250 [ 194.259347] __split_huge_pmd+0x14f/0x170 [ 194.259349] unmap_page_range+0x104d/0x1a00 [ 194.259358] unmap_single_vma+0x7d/0xc0 [ 194.259360] zap_page_range_single_batched+0xe0/0x1c0 [ 194.259363] madvise_vma_behavior+0xb01/0xc00 [ 194.259366] madvise_do_behavior.part.0+0x3cd/0x4a0 [ 194.259375] do_madvise+0xc7/0x170 [ 194.259380] __x64_sys_madvise+0x2f/0x40 [ 194.259382] x64_sys_call+0x1d77/0x21b0 [ 194.259385] do_syscall_64+0x56/0x640 [ 194.259388] entry_SYSCALL_64_after_hwframe+0x76/0x7e In __split_huge_pmd(), the .invalidate callback is made *before* the 2 Meg page is actually split by the root partition. So mshv_chunk_stride() returns "9" for the stride, and the hypervisor is called with HV_MAP_GPA_LARGE_PAGE set. My guess is that the hypervisor returns an error because it has already split the mapping. The whole point of this patch set is to avoid passing HV_MAP_GPA_LARGE_PAGE to the hypervisor when the hypervisor mapping is not a large page mapping, but this looks like a case where it still happens. My concern is solely from looking at the code and thinking about the problem, as I don't have an environment where I can test root partition interactions with the hypervisor. So maybe I'm missing something. Lemme know what you think ..... Michael [1] https://lore.kernel.org/linux-hyperv/sn6pr02mb4157978dfaa6c2584d0678e1d4...@sn6pr02mb4157.namprd02.prod.outlook.com/ > > Fixes: abceb4297bf8 ("mshv: Fix huge page handling in memory region > traversal") > Signed-off-by: Stanislav Kinsburskii <[email protected]> > --- > drivers/hv/mshv_regions.c | 94 > ++++++++++++++++++++++++++++++--------------- > 1 file changed, 63 insertions(+), 31 deletions(-) > > diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c > index 30bacba6aec3..29776019bcde 100644 > --- a/drivers/hv/mshv_regions.c > +++ b/drivers/hv/mshv_regions.c > @@ -19,6 +19,42 @@ > > #define MSHV_MAP_FAULT_IN_PAGES PTRS_PER_PMD > > +/** > + * mshv_chunk_stride - Compute stride for mapping guest memory > + * @page : The page to check for huge page backing > + * @gfn : Guest frame number for the mapping > + * @page_count: Total number of pages in the mapping > + * > + * Determines the appropriate stride (in pages) for mapping guest memory. > + * Uses huge page stride if the backing page is huge and the guest mapping > + * is properly aligned; otherwise falls back to single page stride. > + * > + * Return: Stride in pages, or -EINVAL if page order is unsupported. > + */ > +static int mshv_chunk_stride(struct page *page, > + u64 gfn, u64 page_count) > +{ > + unsigned int page_order; > + > + page_order = folio_order(page_folio(page)); > + /* The hypervisor only supports 4K and 2M page sizes */ > + if (page_order && page_order != PMD_ORDER) > + return -EINVAL; > + > + /* > + * Default to a single page stride. If page_order is set and both > + * the guest frame number (gfn) and page_count are huge-page > + * aligned (PTRS_PER_PMD), use a larger stride so the mapping can > + * be backed by a huge page in both guest and hypervisor. > + */ > + if (page_order && > + IS_ALIGNED(gfn, PTRS_PER_PMD) && > + IS_ALIGNED(page_count, PTRS_PER_PMD)) > + return 1 << page_order; > + > + return 1; > +} > + > /** > * mshv_region_process_chunk - Processes a contiguous chunk of memory pages > * in a region. > @@ -45,25 +81,23 @@ static long mshv_region_process_chunk(struct > mshv_mem_region *region, > int (*handler)(struct mshv_mem_region > *region, > u32 flags, > u64 page_offset, > - u64 page_count)) > + u64 page_count, > + bool huge_page)) > { > - u64 count, stride; > - unsigned int page_order; > + u64 gfn = region->start_gfn + page_offset; > + u64 count; > struct page *page; > - int ret; > + int stride, ret; > > page = region->pages[page_offset]; > if (!page) > return -EINVAL; > > - page_order = folio_order(page_folio(page)); > - /* The hypervisor only supports 4K and 2M page sizes */ > - if (page_order && page_order != PMD_ORDER) > - return -EINVAL; > - > - stride = 1 << page_order; > + stride = mshv_chunk_stride(page, gfn, page_count); > + if (stride < 0) > + return stride; > > - /* Start at stride since the first page is validated */ > + /* Start at stride since the first stride is validated */ > for (count = stride; count < page_count; count += stride) { > page = region->pages[page_offset + count]; > > @@ -71,12 +105,13 @@ static long mshv_region_process_chunk(struct > mshv_mem_region *region, > if (!page) > break; > > - /* Break if page size changes */ > - if (page_order != folio_order(page_folio(page))) > + /* Break if stride size changes */ > + if (stride != mshv_chunk_stride(page, gfn + count, > + page_count - count)) > break; > } > > - ret = handler(region, flags, page_offset, count); > + ret = handler(region, flags, page_offset, count, stride > 1); > if (ret) > return ret; > > @@ -108,7 +143,8 @@ static int mshv_region_process_range(struct > mshv_mem_region *region, > int (*handler)(struct mshv_mem_region > *region, > u32 flags, > u64 page_offset, > - u64 page_count)) > + u64 page_count, > + bool huge_page)) > { > long ret; > > @@ -162,11 +198,10 @@ struct mshv_mem_region *mshv_region_create(u64 > guest_pfn, u64 nr_pages, > > static int mshv_region_chunk_share(struct mshv_mem_region *region, > u32 flags, > - u64 page_offset, u64 page_count) > + u64 page_offset, u64 page_count, > + bool huge_page) > { > - struct page *page = region->pages[page_offset]; > - > - if (PageHuge(page) || PageTransCompound(page)) > + if (huge_page) > flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE; > > return hv_call_modify_spa_host_access(region->partition->pt_id, > @@ -188,11 +223,10 @@ int mshv_region_share(struct mshv_mem_region *region) > > static int mshv_region_chunk_unshare(struct mshv_mem_region *region, > u32 flags, > - u64 page_offset, u64 page_count) > + u64 page_offset, u64 page_count, > + bool huge_page) > { > - struct page *page = region->pages[page_offset]; > - > - if (PageHuge(page) || PageTransCompound(page)) > + if (huge_page) > flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE; > > return hv_call_modify_spa_host_access(region->partition->pt_id, > @@ -212,11 +246,10 @@ int mshv_region_unshare(struct mshv_mem_region *region) > > static int mshv_region_chunk_remap(struct mshv_mem_region *region, > u32 flags, > - u64 page_offset, u64 page_count) > + u64 page_offset, u64 page_count, > + bool huge_page) > { > - struct page *page = region->pages[page_offset]; > - > - if (PageHuge(page) || PageTransCompound(page)) > + if (huge_page) > flags |= HV_MAP_GPA_LARGE_PAGE; > > return hv_call_map_gpa_pages(region->partition->pt_id, > @@ -295,11 +328,10 @@ int mshv_region_pin(struct mshv_mem_region *region) > > static int mshv_region_chunk_unmap(struct mshv_mem_region *region, > u32 flags, > - u64 page_offset, u64 page_count) > + u64 page_offset, u64 page_count, > + bool huge_page) > { > - struct page *page = region->pages[page_offset]; > - > - if (PageHuge(page) || PageTransCompound(page)) > + if (huge_page) > flags |= HV_UNMAP_GPA_LARGE_PAGE; > > return hv_call_unmap_gpa_pages(region->partition->pt_id, > >
