zone_device: Reinitialize large zone device private folios

Matthew Brost Fri, 16 Jan 2026 19:56:33 -0800

On Fri, Jan 16, 2026 at 08:51:14PM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 16, 2026 at 12:31:25PM -0800, Matthew Brost wrote:
> > > I suppose we could be getting say an order-9 folio that was previously 
> > > used
> > > as two order-8 folios? And each of them had their _nr_pages in their head
> > 
> > Yes, this is a good example. At this point we have idea what previous
> > allocation(s) order(s) were - we could have multiple places in the loop
> > where _nr_pages is populated, thus we have to clear this everywhere. 
> 
> Why? The fact you have to use such a crazy expression to even access
> _nr_pages strongly says nothing will read it as _nr_pages.
> 
> Explain each thing:
> 
>               new_page->flags.f &= ~0xffUL;   /* Clear possible order, page 
> head */
> 
> OK, the tail page flags need to be set right, and prep_compound_page()
> called later depends on them being zero.
> 
>               ((struct folio *)(new_page - 1))->_nr_pages = 0;
> 
> Can't see a reason, nothing reads _nr_pages from a random tail
> page. _nr_pages is the last 8 bytes of struct page so it overlaps
> memcg_data, which is also not supposed to be read from a tail page?
> 
>               new_folio->mapping = NULL;
> 
> Pointless, prep_compound_page() -> prep_compound_tail() -> p->mapping = 
> TAIL_MAPPING;
> 
>               new_folio->pgmap = pgmap;       /* Also clear compound head */
> 
> Pointless, compound_head is set in prep_compound_tail(): set_compound_head(p, 
> head);
> 
>               new_folio->share = 0;   /* fsdax only, unused for device 
> private */
> 
> Not sure, certainly share isn't read from a tail page..
> 
> > > > Why can't this use the normal helpers, like memmap_init_compound()?
> > > > 
> > > >  struct folio *new_folio = page
> > > > 
> > > >  /* First 4 tail pages are part of struct folio */
> > > >  for (i = 4; i < (1UL << order); i++) {
> > > >      prep_compound_tail(..)
> > > >  }
> > > > 
> > > >  prep_comound_head(page, order)
> > > >  new_folio->_nr_pages = 0
> > > > 
> > > > ??
> > 
> > I've beat this to death with Alistair, normal helpers do not work here.
> 
> What do you mean? It already calls prep_compound_page()! The issue
> seems to be that prep_compound_page() makes assumptions about what
> values are in flags already?
> 
> So how about move that page flags mask logic into
> prep_compound_tail()? I think that would help Vlastimil's
> concern. That function is already touching most of the cache line so
> an extra word shouldn't make a performance difference.
> 
> > An order zero allocation could have _nr_pages set in its page,
> > new_folio->_nr_pages is page + 1 memory.
> 
> An order zero allocation does not have _nr_pages because it is in page
> +1 memory that doesn't exist.
> 
> An order zero allocation might have memcg_data in the same slot, does
> it need zeroing? If so why not add that to prep_compound_head() ?
> 
> Also, prep_compound_head() handles order 0 too:
> 
>       if (IS_ENABLED(CONFIG_64BIT) || order > 1) {
>               atomic_set(&folio->_pincount, 0);
>               atomic_set(&folio->_entire_mapcount, -1);
>       }
>       if (order > 1)
>               INIT_LIST_HEAD(&folio->_deferred_list);
> 
> So some of the problem here looks to be not calling it:
> 
>       if (order)
>               prep_compound_page(page, order);
> 
> So, remove that if ? Also shouldn't it be moved above the
> set_page_count/lock_page ?
>


I'm not addressing each comment, some might be valid, others are not.

Ok, can I rework this in a follow-up - I will commit to that? Anything
we touch here is extremely sensitive to failures - Intel is the primary
test vector for any modification to device pages for what I can tell.

The fact is that large device pages do not really work without this
patch, or prior revs. I’ve spent a lot of time getting large device
pages stable — both here and in the initial series, commiting to help in
follow on series touch SVM related things.

I’m going to miss my merge window with this (RB’d) patch blocked for
large device pages. Expect my commitment to helping other vendors to
drop if this happens. I’ll maybe just say: that doesn’t work in my CI,
try again.

Or perhaps we just revert large device pages in 6.19 if we can't get a
consensus here as we shouldn't ship a non-functional kernel.

Matt

> Jason

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Reply via email to