On Mon, Jun 1, 2026 at 2:11 AM David Hildenbrand (Arm) <[email protected]> wrote:
>
> On 5/22/26 17:00, Nico Pache wrote:
>
> Finally time for the core piece :)
*music intensifies* :p
>
> > Enable khugepaged to collapse to mTHP orders. This patch implements the
> > main scanning logic using a bitmap to track occupied pages and a stack
> > structure that allows us to find optimal collapse sizes.
> >
> > Previous to this patch, PMD collapse had 3 main phases, a light weight
> > scanning phase (mmap_read_lock) that determines a potential PMD
> > collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> > phase (mmap_write_lock).
> >
> > To enabled mTHP collapse we make the following changes:
> >
> > During PMD scan phase, track occupied pages in a bitmap. When mTHP
> > orders are enabled, we remove the restriction of max_ptes_none during the
> > scan phase to avoid missing potential mTHP collapse candidates. Once we
> > have scanned the full PMD range and updated the bitmap to track occupied
> > pages, we use the bitmap to find the optimal mTHP size.
> >
> > Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> > and determine the best eligible order for the collapse. A stack structure
> > is used instead of traditional recursion to manage the search. This also
> > prevents a traditional recursive approach when the kernel stack struct is
> > limited. The algorithm recursively splits the bitmap into smaller chunks to
> > find the highest order mTHPs that satisfy the collapse criteria. We start
> > by attempting the PMD order, then moved on the consecutively lower orders
> > (mTHP collapse). The stack maintains a pair of variables (offset, order),
> > indicating the number of PTEs from the start of the PMD, and the order of
> > the potential collapse candidate.
> >
> > The algorithm for consuming the bitmap works as such:
> > 1) push (0, HPAGE_PMD_ORDER) onto the stack
> > 2) pop the stack
> > 3) check if the number of set bits in that (offset,order) pair
> > statisfy the max_ptes_none threshold for that order
> > 4) if yes, attempt collapse
> > 5) if no (or collapse fails), push two new stack items representing
> > the left and right halves of the current bitmap range, at the
> > next lower order
> > 6) repeat at step (2) until stack is empty.
> >
> > Below is a diagram representing the algorithm and stack items:
> >
> > offset mid_offset
> > | |
> > | |
> > v v
> > ____________________________________
> > | PTE Page Table |
> > --------------------------------------
> > <-------><------->
> > order-1 order-1
>
>
> Reading this, it is unclear why exactly do we need the stack.
So I looked into your items below. It seems logical, and I think it
works the same way; however, your method seems slightly harder to
understand due to all the edge cases and more error-prone to future
changes (the stack holds implicit knowledge of the offset/order that
must now be tracked in the edge cases).
Given the stack is 24 bytes, I'm not sure if the extra complexity is
worth saving that small amount of memory. Although we would also be
getting rid of (3?) functions, so both approaches have pros and cons.
I will implement a patch comparing your solution against mine and send
it here, then we can decide which approach is better.
>
> Why can't you work with offset + cur_order?
>
> Initially,
>
> offset = 0;
> cur_order = HPAGE_PMD_ORDER;
>
> If collapse succeeded, advance to next range.
> If collapse failed, try next smaller order, keeping offset unchanged.
>
> if (failed && cur_order > KHUGEPAGED_MIN_MTHP_ORDER) {
> /* Try next smaller order. */
> cur_order = cur_order - 1;
> } else {
> /* Skip to next chunk. */
> offset += 1 << cur_order;
> cur_order = max_order_from_offset(offset);
> }
>
> Of course, handling disabled orders. max_order_from_offset() is rather trivial
> (natural buddy order, capped at HPAGE_PMD_ORDER).
>
> What's the benefit of the stack?
>
> >
> > mTHP collapses reject regions containing swapped out or shared pages.
> > This is because adding new entries can lead to new none pages, and these
> > may lead to constant promotion into a higher order mTHP. A similar
> > issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> > introducing at least 2x the number of pages, and on a future scan will
> > satisfy the promotion condition once again. This issue is prevented via
> > the collapse_max_ptes_none() function which imposes the max_ptes_none
> > restrictions above.
> >
> > We currently only support mTHP collapse for max_ptes_none values of 0
> > and HPAGE_PMD_NR - 1. resulting in the following behavior:
> >
> > - max_ptes_none=0: Never introduce new empty pages during collapse
> > - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
> > available mTHP order
> >
> > Any other max_ptes_none value will emit a warning and default mTHP
> > collapse to max_ptes_none=0. There should be no behavior change for PMD
> > collapse.
> >
> > Once we determine what mTHP sizes fits best in that PMD range a collapse
> > is attempted. A minimum collapse order of 2 is used as this is the lowest
> > order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
> >
> > Currently madv_collapse is not supported and will only attempt PMD
> > collapse.
> >
> > We can also remove the check for is_khugepaged inside the PMD scan as
> > the collapse_max_ptes_none() function handles this logic now.
> >
> > Signed-off-by: Nico Pache <[email protected]>
> > ---
> > mm/khugepaged.c | 181 +++++++++++++++++++++++++++++++++++++++++++++---
> > 1 file changed, 172 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 64ceebc9d8a7..d3d7db8be26c 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -99,6 +99,30 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash,
> > MM_SLOTS_HASH_BITS);
> >
> > static struct kmem_cache *mm_slot_cache __ro_after_init;
> >
> > +#define KHUGEPAGED_MIN_MTHP_ORDER 2
> > +/*
> > + * mthp_collapse() does an iterative DFS over a binary tree, from
> > + * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> > + * size needed for a DFS on a binary tree is height + 1, where
> > + * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> > + *
> > + * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> > + * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
>
> I was confused there for a second why you mention ilog2, when it's really "We
> cannot use HPAGE_PMD_ORDER.".
>
> Best to simplify to:
>
> "Note that we cannot use HPAGE_PMD_ORDER, because it is variable on some
> architectures".
Ok thank you i can clear that up.
>
> > + */
> > +#define MTHP_STACK_SIZE (ilog2(MAX_PTRS_PER_PTE) -
> > KHUGEPAGED_MIN_MTHP_ORDER + 1)
> > +
> > +/*
> > + * Defines a range of PTE entries in a PTE page table which are being
> > + * considered for mTHP collapse.
> > + *
> > + * @offset: the offset of the first PTE entry in a PMD range.
> > + * @order: the order of the PTE entries being considered for collapse.
> > + */
> > +struct mthp_range {
> > + u16 offset;
> > + u8 order;
> > +};
> > +
> > struct collapse_control {
> > bool is_khugepaged;
> >
> > @@ -110,6 +134,12 @@ struct collapse_control {
> >
> > /* nodemask for allocation fallback */
> > nodemask_t alloc_nmask;
> > +
> > + /* Each bit represents a single occupied (!none/zero) page. */
> > + DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
>
> This should just be called something like "present_ptes"
yeah not a bad idea.
>
> > + /* A mask of the current range being considered for mTHP collapse. */
> > + DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> > + struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
>
> This is really just a temporary bitmap used for collapse_mthp_count_present()
> only. Either rename it, or better, avoid it completely.
yeah when i first started this we didnt have bitmap_weight_from()
thanks for the pointer to that, I no longer need a temp bitmap.
>
> > };
> >
> > /**
> > @@ -1411,20 +1441,137 @@ static enum scan_result collapse_huge_page(struct
> > mm_struct *mm, unsigned long s
> > return result;
> > }
> >
> > +static void collapse_mthp_stack_push(struct collapse_control *cc, int
> > *stack_size,
> > + u16 offset, u8 order)
> > +{
> > + const int size = *stack_size;
> > + struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> > +
> > + VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> > + stack->order = order;
> > + stack->offset = offset;
> > + (*stack_size)++;
> > +}
> > +
> > +static struct mthp_range collapse_mthp_stack_pop(struct collapse_control
> > *cc,
> > + int *stack_size)
> > +{
> > + const int size = *stack_size;
> > +
> > + VM_WARN_ON_ONCE(size <= 0);
> > + (*stack_size)--;
> > + return cc->mthp_bitmap_stack[size - 1];
> > +}
> > +
> > +static unsigned int collapse_mthp_count_present(struct collapse_control
> > *cc,
> > + u16 offset, unsigned int
> > nr_ptes)
> > +{
> > + bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> > + bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes);
> > + return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask,
> > MAX_PTRS_PER_PTE);
>
> You really just want to count the number of set bits? You don't need a
> temporary
> bitmap for that.
>
> Assume you want to check an order-2 (4 bits), bitmap_weight_and() would check
> all bits ...
>
> I'd suggest starting simple here, and avoiding the temporary bitmap.
>
> Can we simply use bitmap_weight_from(cc->mthp_bitmap, offset, nr_ptes)?
Yes! Thank you :)
>
> > +}
> > +
> > +/*
> > + * mthp_collapse() consumes the bitmap that is generated during
> > + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> > + *
> > + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero)
> > page.
> > + * A stack structure cc->mthp_bitmap_stack is used to check different
> > regions
> > + * of the bitmap for collapse eligibility. The stack maintains a pair of
> > + * variables (offset, order), indicating the number of PTEs from the start
> > of
> > + * the PMD, and the order of the potential collapse candidate
> > respectively. We
> > + * start at the PMD order and check if it is eligible for collapse; if
> > not, we
> > + * add two entries to the stack at a lower order to represent the left and
> > right
> > + * halves of the PTE page table we are examining.
> > + *
> > + * offset mid_offset
> > + * | |
> > + * | |
> > + * v v
> > + * --------------------------------------
> > + * | cc->mthp_bitmap |
> > + * --------------------------------------
> > + * <-------><------->
> > + * order-1 order-1
> > + *
> > + * For each of these, we determine how many PTE entries are occupied in the
> > + * range of PTE entries we propose to collapse, then we compare this to a
> > + * threshold number of PTE entries which would need to be occupied for a
> > + * collapse to be permitted at that order (accounting for max_ptes_none).
> > + *
> > + * If a collapse is permitted, we attempt to collapse the PTE range into a
> > + * mTHP.
> > + */
> > +static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
> > + unsigned long address, int referenced, int unmapped,
> > + struct collapse_control *cc, unsigned long enabled_orders)
> > +{
> > + unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
> > + int collapsed = 0, stack_size = 0;
> > + unsigned long collapse_address;
> > + struct mthp_range range;
> > + u16 offset;
> > + u8 order;
> > +
> > + collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> > +
> > + while (stack_size) {
> > + range = collapse_mthp_stack_pop(cc, &stack_size);
> > + order = range.order;
> > + offset = range.offset;
> > + nr_ptes = 1UL << order;
> > +
> > + if (!test_bit(order, &enabled_orders))
> > + goto next_order;
> > +
> > + max_ptes_none = collapse_max_ptes_none(cc, vma, order);
> > +
> > + nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> > + nr_ptes);
> > +
> > + if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> > + int ret;
> > +
> > + collapse_address = address + offset * PAGE_SIZE;
> > + ret = collapse_huge_page(mm, collapse_address,
> > referenced,
> > + unmapped, cc, order);
> > + if (ret == SCAN_SUCCEED) {
> > + collapsed += nr_ptes;
> > + continue;
> > + }
> > + }
> > +
> > +next_order:
> > + if ((BIT(order) - 1) & enabled_orders) {
> > + const u8 next_order = order - 1;
> > + const u16 mid_offset = offset + (nr_ptes / 2);
> > +
> > + collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> > + next_order);
> > + collapse_mthp_stack_push(cc, &stack_size, offset,
> > + next_order);
> > + }
> > + }
> > + return collapsed;
> > +}
> > +
> > static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > struct vm_area_struct *vma, unsigned long start_addr,
> > bool *lock_dropped, struct collapse_control *cc)
> > {
> > - const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma,
> > HPAGE_PMD_ORDER);
> > const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc,
> > HPAGE_PMD_ORDER);
> > const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc,
> > HPAGE_PMD_ORDER);
> > + unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma,
> > HPAGE_PMD_ORDER);
> > + enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED :
> > TVA_FORCED_COLLAPSE;
> > pmd_t *pmd;
> > - pte_t *pte, *_pte;
> > - int none_or_zero = 0, shared = 0, referenced = 0;
> > + pte_t *pte, *_pte, pteval;
> > + int i;
> > + int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
> > enum scan_result result = SCAN_FAIL;
> > struct page *page = NULL;
> > struct folio *folio = NULL;
> > unsigned long addr;
> > + unsigned long enabled_orders;
> > spinlock_t *ptl;
> > int node = NUMA_NO_NODE, unmapped = 0;
> >
> > @@ -1436,8 +1583,19 @@ static enum scan_result collapse_scan_pmd(struct
> > mm_struct *mm,
> > goto out;
> > }
> >
> > + bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
> > memset(cc->node_load, 0, sizeof(cc->node_load));
> > nodes_clear(cc->alloc_nmask);
> > +
> > + enabled_orders = collapse_allowable_orders(vma, vma->vm_flags,
> > tva_flags);
> > +
> > + /*
> > + * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> > + * scan all pages to populate the bitmap for mTHP collapse.
> > + */
>
> You should note here, that we re-verify in mthp_collapse().
>
> But the question is, whether we should relocate the check completely into
> mthp_collapse(), instead of conditionally duplicating it.
>
> What speaks against always populating the bitmap and making the decision in
> mthp_collapse()?
>
> Sure, we might scan a page table a bit longer, but the code gets clearer ...
> and
> I am not sure if scanning some more page table entries is really that
> critical here.
Someone asked me to preserve the legacy behavior (PMD only). Although
rather trivial, if you set max_ptes_none=0 for example, we'd still
have to do 511 iterations for no reason if PMD collapse is the only
enabled order rather than bailing immediately.
I'm ok with dropping it, but I think its the correct approach (despite
the extra complexity). @Usama Arif brought up this point here
https://lore.kernel.org/all/[email protected]/
>
>
> > + if (enabled_orders != BIT(HPAGE_PMD_ORDER))
> > + max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
> > +
> > pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
> > if (!pte) {
> > cc->progress++;
> > @@ -1445,11 +1603,13 @@ static enum scan_result collapse_scan_pmd(struct
> > mm_struct *mm,
> > goto out;
> > }
> >
> > - for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> > - _pte++, addr += PAGE_SIZE) {
> > + for (i = 0; i < HPAGE_PMD_NR; i++) {
> > + _pte = pte + i;
> > + addr = start_addr + i * PAGE_SIZE;
> > + pteval = ptep_get(_pte);
> > +
> > cc->progress++;
> >
> > - pte_t pteval = ptep_get(_pte);
> > if (pte_none_or_zero(pteval)) {
> > if (++none_or_zero > max_ptes_none) {
> > result = SCAN_EXCEED_NONE_PTE;
> > @@ -1529,6 +1689,8 @@ static enum scan_result collapse_scan_pmd(struct
> > mm_struct *mm,
> > }
> > }
> >
> > + /* Set bit for occupied pages */
> > + __set_bit(i, cc->mthp_bitmap);
> > /*
> > * Record which node the original page is from and save this
> > * information to cc->node_load[].
> > @@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct
> > mm_struct *mm,
> > if (result == SCAN_SUCCEED) {
> > /* collapse_huge_page expects the lock to be dropped before
> > calling */
> > mmap_read_unlock(mm);
> > - result = collapse_huge_page(mm, start_addr, referenced,
> > - unmapped, cc, HPAGE_PMD_ORDER);
> > - /* collapse_huge_page will return with the mmap_lock released
> > */
> > + nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
> > + unmapped, cc, enabled_orders);
> > + /* mmap_lock was released above, set lock_dropped */
> > *lock_dropped = true;
> > + result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
>
> As Lance says, this error handling likely needs some thought.
Yes I agree, I'm working on that now. I had a WIP patch for this some
time ago, but I never gave it the full thought it needed, and it fell
into the abyss.
Thanks :)
-- Nico
>
> --
> Cheers,
>
> David
>