On Thu, Jan 08, 2026 at 11:56:03AM +1100, Balbir Singh wrote:
> On 1/8/26 08:03, Zi Yan wrote:
> > On 7 Jan 2026, at 16:15, Matthew Brost wrote:
> >
> >> On Wed, Jan 07, 2026 at 03:38:35PM -0500, Zi Yan wrote:
> >>> On 7 Jan 2026, at 15:20, Zi Yan wrote:
> >>>
> >>>> +THP folks
> >>>
> >>> +willy, since he commented in another thread.
> >>>
> >>>>
> >>>> On 16 Dec 2025, at 15:10, Francois Dugast wrote:
> >>>>
> >>>>> From: Matthew Brost <[email protected]>
> >>>>>
> >>>>> Introduce migrate_device_split_page() to split a device page into
> >>>>> lower-order pages. Used when a folio allocated as higher-order is freed
> >>>>> and later reallocated at a smaller order by the driver memory manager.
> >>>>>
> >>>>> Cc: Andrew Morton <[email protected]>
> >>>>> Cc: Balbir Singh <[email protected]>
> >>>>> Cc: [email protected]
> >>>>> Cc: [email protected]
> >>>>> Signed-off-by: Matthew Brost <[email protected]>
> >>>>> Signed-off-by: Francois Dugast <[email protected]>
> >>>>> ---
> >>>>> include/linux/huge_mm.h | 3 +++
> >>>>> include/linux/migrate.h | 1 +
> >>>>> mm/huge_memory.c | 6 ++---
> >>>>> mm/migrate_device.c | 49 +++++++++++++++++++++++++++++++++++++++++
> >>>>> 4 files changed, 56 insertions(+), 3 deletions(-)
> >>>>>
> >>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >>>>> index a4d9f964dfde..6ad8f359bc0d 100644
> >>>>> --- a/include/linux/huge_mm.h
> >>>>> +++ b/include/linux/huge_mm.h
> >>>>> @@ -374,6 +374,9 @@ int __split_huge_page_to_list_to_order(struct page
> >>>>> *page, struct list_head *list
> >>>>> int folio_split_unmapped(struct folio *folio, unsigned int new_order);
> >>>>> unsigned int min_order_for_split(struct folio *folio);
> >>>>> int split_folio_to_list(struct folio *folio, struct list_head *list);
> >>>>> +int __split_unmapped_folio(struct folio *folio, int new_order,
> >>>>> + struct page *split_at, struct xa_state *xas,
> >>>>> + struct address_space *mapping, enum
> >>>>> split_type split_type);
> >>>>> int folio_check_splittable(struct folio *folio, unsigned int new_order,
> >>>>> enum split_type split_type);
> >>>>> int folio_split(struct folio *folio, unsigned int new_order, struct
> >>>>> page *page,
> >>>>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> >>>>> index 26ca00c325d9..ec65e4fd5f88 100644
> >>>>> --- a/include/linux/migrate.h
> >>>>> +++ b/include/linux/migrate.h
> >>>>> @@ -192,6 +192,7 @@ void migrate_device_pages(unsigned long *src_pfns,
> >>>>> unsigned long *dst_pfns,
> >>>>> unsigned long npages);
> >>>>> void migrate_device_finalize(unsigned long *src_pfns,
> >>>>> unsigned long *dst_pfns, unsigned long npages);
> >>>>> +int migrate_device_split_page(struct page *page);
> >>>>>
> >>>>> #endif /* CONFIG_MIGRATION */
> >>>>>
> >>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >>>>> index 40cf59301c21..7ded35a3ecec 100644
> >>>>> --- a/mm/huge_memory.c
> >>>>> +++ b/mm/huge_memory.c
> >>>>> @@ -3621,9 +3621,9 @@ static void __split_folio_to_order(struct folio
> >>>>> *folio, int old_order,
> >>>>> * Return: 0 - successful, <0 - failed (if -ENOMEM is returned, @folio
> >>>>> might be
> >>>>> * split but not to @new_order, the caller needs to check)
> >>>>> */
> >>>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
> >>>>> - struct page *split_at, struct xa_state *xas,
> >>>>> - struct address_space *mapping, enum split_type
> >>>>> split_type)
> >>>>> +int __split_unmapped_folio(struct folio *folio, int new_order,
> >>>>> + struct page *split_at, struct xa_state *xas,
> >>>>> + struct address_space *mapping, enum
> >>>>> split_type split_type)
> >>>>> {
> >>>>> const bool is_anon = folio_test_anon(folio);
> >>>>> int old_order = folio_order(folio);
> >>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> >>>>> index 23379663b1e1..eb0f0e938947 100644
> >>>>> --- a/mm/migrate_device.c
> >>>>> +++ b/mm/migrate_device.c
> >>>>> @@ -775,6 +775,49 @@ int migrate_vma_setup(struct migrate_vma *args)
> >>>>> EXPORT_SYMBOL(migrate_vma_setup);
> >>>>>
> >>>>> #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> >>>>> +/**
> >>>>> + * migrate_device_split_page() - Split device page
> >>>>> + * @page: Device page to split
> >>>>> + *
> >>>>> + * Splits a device page into smaller pages. Typically called when
> >>>>> reallocating a
> >>>>> + * folio to a smaller size. Inherently racy—only safe if the caller
> >>>>> ensures
> >>>>> + * mutual exclusion within the page's folio (i.e., no other threads
> >>>>> are using
> >>>>> + * pages within the folio). Expected to be called a free device page
> >>>>> and
> >>>>> + * restores all split out pages to a free state.
> >>>>> + */
> >>>
> >>> Do you mind explaining why __split_unmapped_folio() is needed for a free
> >>> device
> >>> page? A free page is not supposed to be a large folio, at least from a
> >>> core
> >>> MM point of view. __split_unmapped_folio() is intended to work on large
> >>> folios
> >>> (or compound pages), even if the input folio has refcount == 0 (because
> >>> it is
> >>> frozen).
> >>>
> >>
> >> Well, then maybe this is a bug in core MM where the freed page is still
> >> a THP. Let me explain the scenario and why this is needed from my POV.
> >>
> >> Our VRAM allocator in Xe (and several other DRM drivers) is DRM buddy.
> >> This is a shared pool between traditional DRM GEMs (buffer objects) and
> >> SVM allocations (pages). It doesn’t have any view of the page backing—it
> >> basically just hands back a pointer to VRAM space that we allocate from.
> >> From that, if it’s an SVM allocation, we can derive the device pages.
> >>
> >> What I see happening is: a 2M buddy allocation occurs, we make the
> >> backing device pages a large folio, and sometime later the folio
> >> refcount goes to zero and we free the buddy allocation. Later, the buddy
> >> allocation is reused for a smaller allocation (e.g., 4K or 64K), but the
> >> backing pages are still a large folio. Here is where we need to split
> >
> > I agree with you that it might be a bug in free_zone_device_folio() based
> > on my understanding. Since zone_device_page_init() calls
> > prep_compound_page()
> > for >0 orders, but free_zone_device_folio() never reverse the process.
> >
> > Balbir and Alistair might be able to help here.
>
> I agree it's an API limitation
>
> >
> > I cherry picked the code from __free_frozen_pages() to reverse the process.
> > Can you give it a try to see if it solve the above issue? Thanks.
> >
> > From 3aa03baa39b7e62ea079e826de6ed5aab3061e46 Mon Sep 17 00:00:00 2001
> > From: Zi Yan <[email protected]>
> > Date: Wed, 7 Jan 2026 16:49:52 -0500
> > Subject: [PATCH] mm/memremap: free device private folio fix
> > Content-Type: text/plain; charset="utf-8"
> >
> > Signed-off-by: Zi Yan <[email protected]>
> > ---
> > mm/memremap.c | 15 +++++++++++++++
> > 1 file changed, 15 insertions(+)
> >
> > diff --git a/mm/memremap.c b/mm/memremap.c
> > index 63c6ab4fdf08..483666ff7271 100644
> > --- a/mm/memremap.c
> > +++ b/mm/memremap.c
> > @@ -475,6 +475,21 @@ void free_zone_device_folio(struct folio *folio)
> > pgmap->ops->folio_free(folio);
> > break;
> > }
> > +
> > + if (nr > 1) {
> > + struct page *head = folio_page(folio, 0);
> > +
> > + head[1].flags.f &= ~PAGE_FLAGS_SECOND;
> > +#ifdef NR_PAGES_IN_LARGE_FOLIO
> > + folio->_nr_pages = 0;
> > +#endif
> > + for (i = 1; i < nr; i++) {
> > + (head + i)->mapping = NULL;
> > + clear_compound_head(head + i);
>
> I see that your skipping the checks in free_page_tail_prepare()? IIUC, we
> should be able
> to invoke it even for zone device private pages
>
> > + }
> > + folio->mapping = NULL;
>
> This is already done in free_zone_device_folio()
>
> > + head->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>
> I don't think this is required for zone device private folios, but I suppose
> it
> keeps the code generic
>
Well, the above code doesn’t work, but I think it’s the right idea.
clear_compound_head aliases to pgmap, which we don’t want to be NULL. I
believe the individual pages likely need their flags cleared (?), and
this step must be done before calling folio_free and include a barrier,
as the page can be immediately reallocated.
So here’s what I came up with, and it seems to work (for Xe, GPU SVM):
mm/memremap.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/mm/memremap.c b/mm/memremap.c
index 63c6ab4fdf08..ac20abb6a441 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -448,6 +448,27 @@ void free_zone_device_folio(struct folio *folio)
pgmap->type != MEMORY_DEVICE_GENERIC)
folio->mapping = NULL;
+ if (nr > 1) {
+ struct page *head = folio_page(folio, 0);
+
+ head[1].flags.f &= ~PAGE_FLAGS_SECOND;
+#ifdef NR_PAGES_IN_LARGE_FOLIO
+ folio->_nr_pages = 0;
+#endif
+ for (i = 1; i < nr; i++) {
+ struct folio *new_folio = (struct folio *)(head + i);
+
+ (head + i)->mapping = NULL;
+ (head + i)->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
+
+ /* Overwrite compound_head with pgmap */
+ new_folio->pgmap = pgmap;
+ }
+
+ head->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
+ smp_wmb(); /* Changes but be visable before freeing folio
*/
+ }
+
switch (pgmap->type) {
case MEMORY_DEVICE_PRIVATE:
case MEMORY_DEVICE_COHERENT:
> > + }
> > }
> >
> > void zone_device_page_init(struct page *page, unsigned int order)
>
>
> Otherwise, it seems like the right way to solve the issue.
>
My question is: why isn’t Nouveau hitting this issue, or your Nvidia
out-of-tree driver (lack of testing, Xe's test suite coverage is quite
good at finding corners).
Also, will this change in behavior break either ofthose drivers?
Matt
> Balbir