On 1/8/26 13:42, Matthew Brost wrote: > On Thu, Jan 08, 2026 at 02:14:28PM +1100, Alistair Popple wrote: >> On 2026-01-08 at 13:53 +1100, Zi Yan <[email protected]> wrote... >>> On 7 Jan 2026, at 21:17, Matthew Brost wrote: >>> >>>> On Thu, Jan 08, 2026 at 11:56:03AM +1100, Balbir Singh wrote: >>>>> On 1/8/26 08:03, Zi Yan wrote: >>>>>> On 7 Jan 2026, at 16:15, Matthew Brost wrote: >>>>>> >>>>>>> On Wed, Jan 07, 2026 at 03:38:35PM -0500, Zi Yan wrote: >>>>>>>> On 7 Jan 2026, at 15:20, Zi Yan wrote: >>>>>>>> >>>>>>>>> +THP folks >>>>>>>> >>>>>>>> +willy, since he commented in another thread. >>>>>>>> >>>>>>>>> >>>>>>>>> On 16 Dec 2025, at 15:10, Francois Dugast wrote: >>>>>>>>> >>>>>>>>>> From: Matthew Brost <[email protected]> >>>>>>>>>> >>>>>>>>>> Introduce migrate_device_split_page() to split a device page into >>>>>>>>>> lower-order pages. Used when a folio allocated as higher-order is >>>>>>>>>> freed >>>>>>>>>> and later reallocated at a smaller order by the driver memory >>>>>>>>>> manager. >>>>>>>>>> >>>>>>>>>> Cc: Andrew Morton <[email protected]> >>>>>>>>>> Cc: Balbir Singh <[email protected]> >>>>>>>>>> Cc: [email protected] >>>>>>>>>> Cc: [email protected] >>>>>>>>>> Signed-off-by: Matthew Brost <[email protected]> >>>>>>>>>> Signed-off-by: Francois Dugast <[email protected]> >>>>>>>>>> --- >>>>>>>>>> include/linux/huge_mm.h | 3 +++ >>>>>>>>>> include/linux/migrate.h | 1 + >>>>>>>>>> mm/huge_memory.c | 6 ++--- >>>>>>>>>> mm/migrate_device.c | 49 >>>>>>>>>> +++++++++++++++++++++++++++++++++++++++++ >>>>>>>>>> 4 files changed, 56 insertions(+), 3 deletions(-) >>>>>>>>>> >>>>>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >>>>>>>>>> index a4d9f964dfde..6ad8f359bc0d 100644 >>>>>>>>>> --- a/include/linux/huge_mm.h >>>>>>>>>> +++ b/include/linux/huge_mm.h >>>>>>>>>> @@ -374,6 +374,9 @@ int __split_huge_page_to_list_to_order(struct >>>>>>>>>> page *page, struct list_head *list >>>>>>>>>> int folio_split_unmapped(struct folio *folio, unsigned int >>>>>>>>>> new_order); >>>>>>>>>> unsigned int min_order_for_split(struct folio *folio); >>>>>>>>>> int split_folio_to_list(struct folio *folio, struct list_head >>>>>>>>>> *list); >>>>>>>>>> +int __split_unmapped_folio(struct folio *folio, int new_order, >>>>>>>>>> + struct page *split_at, struct xa_state *xas, >>>>>>>>>> + struct address_space *mapping, enum >>>>>>>>>> split_type split_type); >>>>>>>>>> int folio_check_splittable(struct folio *folio, unsigned int >>>>>>>>>> new_order, >>>>>>>>>> enum split_type split_type); >>>>>>>>>> int folio_split(struct folio *folio, unsigned int new_order, struct >>>>>>>>>> page *page, >>>>>>>>>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h >>>>>>>>>> index 26ca00c325d9..ec65e4fd5f88 100644 >>>>>>>>>> --- a/include/linux/migrate.h >>>>>>>>>> +++ b/include/linux/migrate.h >>>>>>>>>> @@ -192,6 +192,7 @@ void migrate_device_pages(unsigned long >>>>>>>>>> *src_pfns, unsigned long *dst_pfns, >>>>>>>>>> unsigned long npages); >>>>>>>>>> void migrate_device_finalize(unsigned long *src_pfns, >>>>>>>>>> unsigned long *dst_pfns, unsigned long npages); >>>>>>>>>> +int migrate_device_split_page(struct page *page); >>>>>>>>>> >>>>>>>>>> #endif /* CONFIG_MIGRATION */ >>>>>>>>>> >>>>>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>>>>>>>>> index 40cf59301c21..7ded35a3ecec 100644 >>>>>>>>>> --- a/mm/huge_memory.c >>>>>>>>>> +++ b/mm/huge_memory.c >>>>>>>>>> @@ -3621,9 +3621,9 @@ static void __split_folio_to_order(struct >>>>>>>>>> folio *folio, int old_order, >>>>>>>>>> * Return: 0 - successful, <0 - failed (if -ENOMEM is returned, >>>>>>>>>> @folio might be >>>>>>>>>> * split but not to @new_order, the caller needs to check) >>>>>>>>>> */ >>>>>>>>>> -static int __split_unmapped_folio(struct folio *folio, int >>>>>>>>>> new_order, >>>>>>>>>> - struct page *split_at, struct xa_state *xas, >>>>>>>>>> - struct address_space *mapping, enum split_type >>>>>>>>>> split_type) >>>>>>>>>> +int __split_unmapped_folio(struct folio *folio, int new_order, >>>>>>>>>> + struct page *split_at, struct xa_state *xas, >>>>>>>>>> + struct address_space *mapping, enum >>>>>>>>>> split_type split_type) >>>>>>>>>> { >>>>>>>>>> const bool is_anon = folio_test_anon(folio); >>>>>>>>>> int old_order = folio_order(folio); >>>>>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c >>>>>>>>>> index 23379663b1e1..eb0f0e938947 100644 >>>>>>>>>> --- a/mm/migrate_device.c >>>>>>>>>> +++ b/mm/migrate_device.c >>>>>>>>>> @@ -775,6 +775,49 @@ int migrate_vma_setup(struct migrate_vma *args) >>>>>>>>>> EXPORT_SYMBOL(migrate_vma_setup); >>>>>>>>>> >>>>>>>>>> #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION >>>>>>>>>> +/** >>>>>>>>>> + * migrate_device_split_page() - Split device page >>>>>>>>>> + * @page: Device page to split >>>>>>>>>> + * >>>>>>>>>> + * Splits a device page into smaller pages. Typically called when >>>>>>>>>> reallocating a >>>>>>>>>> + * folio to a smaller size. Inherently racy—only safe if the caller >>>>>>>>>> ensures >>>>>>>>>> + * mutual exclusion within the page's folio (i.e., no other threads >>>>>>>>>> are using >>>>>>>>>> + * pages within the folio). Expected to be called a free device >>>>>>>>>> page and >>>>>>>>>> + * restores all split out pages to a free state. >>>>>>>>>> + */ >>>>>>>> >>>>>>>> Do you mind explaining why __split_unmapped_folio() is needed for a >>>>>>>> free device >>>>>>>> page? A free page is not supposed to be a large folio, at least from a >>>>>>>> core >>>>>>>> MM point of view. __split_unmapped_folio() is intended to work on >>>>>>>> large folios >>>>>>>> (or compound pages), even if the input folio has refcount == 0 >>>>>>>> (because it is >>>>>>>> frozen). >>>>>>>> >>>>>>> >>>>>>> Well, then maybe this is a bug in core MM where the freed page is still >>>>>>> a THP. Let me explain the scenario and why this is needed from my POV. >>>>>>> >>>>>>> Our VRAM allocator in Xe (and several other DRM drivers) is DRM buddy. >>>>>>> This is a shared pool between traditional DRM GEMs (buffer objects) and >>>>>>> SVM allocations (pages). It doesn’t have any view of the page backing—it >>>>>>> basically just hands back a pointer to VRAM space that we allocate from. >>>>>>> From that, if it’s an SVM allocation, we can derive the device pages. >>>>>>> >>>>>>> What I see happening is: a 2M buddy allocation occurs, we make the >>>>>>> backing device pages a large folio, and sometime later the folio >>>>>>> refcount goes to zero and we free the buddy allocation. Later, the buddy >>>>>>> allocation is reused for a smaller allocation (e.g., 4K or 64K), but the >>>>>>> backing pages are still a large folio. Here is where we need to split >>>>>> >>>>>> I agree with you that it might be a bug in free_zone_device_folio() based >>>>>> on my understanding. Since zone_device_page_init() calls >>>>>> prep_compound_page() >>>>>> for >0 orders, but free_zone_device_folio() never reverse the process. >>>>>> >>>>>> Balbir and Alistair might be able to help here. >> >> Just catching up after the Christmas break. >> > > I think everyone is and scrambling for the release PR. :) > >>>>> >>>>> I agree it's an API limitation >>> >>> I am not sure. If free_zone_device_folio() does not get rid of large folio >>> metadata, there is no guarantee that a freed large device private folio will >>> be reallocated as a large device private folio. And when mTHP support is >>> added, the folio order might change too. That can cause issues when >>> compound_head() is called on a tail page of a previously large folio, since >>> compound_head() will return the old head page instead of the tail page >>> itself. >> >> I agree that freeing the device folio should get rid of the large folio. That >> would also keep it consistent with what we do for FS DAX for example. >> > > +1 > >>>>> >>>>>> >>>>>> I cherry picked the code from __free_frozen_pages() to reverse the >>>>>> process. >>>>>> Can you give it a try to see if it solve the above issue? Thanks. >> >> It would be nice if this could be a common helper for freeing compound >> ZONE_DEVICE pages. FS DAX already has this for example: >> >> static inline unsigned long dax_folio_put(struct folio *folio) >> { >> unsigned long ref; >> int order, i; >> >> if (!dax_folio_is_shared(folio)) >> ref = 0; >> else >> ref = --folio->share; >> >> if (ref) >> return ref; >> >> folio->mapping = NULL; >> order = folio_order(folio); >> if (!order) >> return 0; >> folio_reset_order(folio); >> >> for (i = 0; i < (1UL << order); i++) { >> struct dev_pagemap *pgmap = page_pgmap(&folio->page); >> struct page *page = folio_page(folio, i); >> struct folio *new_folio = (struct folio *)page; >> >> ClearPageHead(page); >> clear_compound_head(page); >> >> new_folio->mapping = NULL; >> /* >> * Reset pgmap which was over-written by >> * prep_compound_page(). >> */ >> new_folio->pgmap = pgmap; >> new_folio->share = 0; >> WARN_ON_ONCE(folio_ref_count(new_folio)); >> } >> >> return ref; >> } >> >> Aside from the weird refcount checks that FS DAX needs to at the start of >> this >> function I don't think there is anything specific to DEVICE_PRIVATE pages >> there. >> > > Thanks for the reference, Alistair. This looks roughly like what I > hacked together in an effort to just get something working. I believe a > common helper can be made to work. Let me churn on this tomorrow and put > together a proper patch. > >>>>>> >>>>>> From 3aa03baa39b7e62ea079e826de6ed5aab3061e46 Mon Sep 17 00:00:00 2001 >>>>>> From: Zi Yan <[email protected]> >>>>>> Date: Wed, 7 Jan 2026 16:49:52 -0500 >>>>>> Subject: [PATCH] mm/memremap: free device private folio fix >>>>>> Content-Type: text/plain; charset="utf-8" >>>>>> >>>>>> Signed-off-by: Zi Yan <[email protected]> >>>>>> --- >>>>>> mm/memremap.c | 15 +++++++++++++++ >>>>>> 1 file changed, 15 insertions(+) >>>>>> >>>>>> diff --git a/mm/memremap.c b/mm/memremap.c >>>>>> index 63c6ab4fdf08..483666ff7271 100644 >>>>>> --- a/mm/memremap.c >>>>>> +++ b/mm/memremap.c >>>>>> @@ -475,6 +475,21 @@ void free_zone_device_folio(struct folio *folio) >>>>>> pgmap->ops->folio_free(folio); >>>>>> break; >>>>>> } >>>>>> + >>>>>> + if (nr > 1) { >>>>>> + struct page *head = folio_page(folio, 0); >>>>>> + >>>>>> + head[1].flags.f &= ~PAGE_FLAGS_SECOND; >>>>>> +#ifdef NR_PAGES_IN_LARGE_FOLIO >>>>>> + folio->_nr_pages = 0; >>>>>> +#endif >>>>>> + for (i = 1; i < nr; i++) { >>>>>> + (head + i)->mapping = NULL; >>>>>> + clear_compound_head(head + i); >>>>> >>>>> I see that your skipping the checks in free_page_tail_prepare()? IIUC, we >>>>> should be able >>>>> to invoke it even for zone device private pages >>> >>> I am not sure about what part of compound page is also used in device >>> private folio. >>> Yes, it is better to add right checks. >>> >>>>> >>>>>> + } >>>>>> + folio->mapping = NULL; >>>>> >>>>> This is already done in free_zone_device_folio() >>>>> >>>>>> + head->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP; >>>>> >>>>> I don't think this is required for zone device private folios, but I >>>>> suppose it >>>>> keeps the code generic >>>>> >>>> >>>> Well, the above code doesn’t work, but I think it’s the right idea. >>>> clear_compound_head aliases to pgmap, which we don’t want to be NULL. I >>> >>> Thank you for pointing it out. I am not familiar with device private page >>> code. >>> >>>> believe the individual pages likely need their flags cleared (?), and >>> >>> Yes, I missed the tail page flag clearing part. >>> > > I think the page head is the only thing that really needs to be cleared, > though I could be wrong. > >>>> this step must be done before calling folio_free and include a barrier, >>>> as the page can be immediately reallocated. >>>> >>>> So here’s what I came up with, and it seems to work (for Xe, GPU SVM): >>>> >>>> mm/memremap.c | 21 +++++++++++++++++++++ >>>> 1 file changed, 21 insertions(+) >>>> >>>> diff --git a/mm/memremap.c b/mm/memremap.c >>>> index 63c6ab4fdf08..ac20abb6a441 100644 >>>> --- a/mm/memremap.c >>>> +++ b/mm/memremap.c >>>> @@ -448,6 +448,27 @@ void free_zone_device_folio(struct folio *folio) >>>> pgmap->type != MEMORY_DEVICE_GENERIC) >>>> folio->mapping = NULL; >>>> >>>> + if (nr > 1) { >>>> + struct page *head = folio_page(folio, 0); >>>> + >>>> + head[1].flags.f &= ~PAGE_FLAGS_SECOND; >>>> +#ifdef NR_PAGES_IN_LARGE_FOLIO >>>> + folio->_nr_pages = 0; >>>> +#endif >>>> + for (i = 1; i < nr; i++) { >>>> + struct folio *new_folio = (struct folio *)(head + >>>> i); >>>> + >>>> + (head + i)->mapping = NULL; >>>> + (head + i)->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP; >>>> + >>>> + /* Overwrite compound_head with pgmap */ >>>> + new_folio->pgmap = pgmap; >>>> + } >>>> + >>>> + head->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP; >>>> + smp_wmb(); /* Changes but be visable before freeing folio >>>> */ >>>> + } >>>> + >>>> switch (pgmap->type) { >>>> case MEMORY_DEVICE_PRIVATE: >>>> case MEMORY_DEVICE_COHERENT: >>>> >>> >>> It looks good to me, but I am very likely missing the detail on device >>> private >>> pages. Like Balbir pointed out above, for tail pages, calling >>> free_tail_page_prepare() might be better to get sanity checks like normal >>> large folio, although you will need to set ->pgmap after it. >>> >>> It is better to send it as a proper patch and get reviews from other >>> MM folks. >>> > > Yes, agreed. See above—I’ll work on a proper patch tomorrow and CC all > the correct MM folks. Aiming to have something ready for the next > release PR. >
Yes, please! Thanks, Balbir
