On Tue, Jun 2, 2026 at 11:22 AM Nico Pache <[email protected]> wrote:
>
>
>
> On 6/1/26 7:15 AM, David Hildenbrand (Arm) wrote:
> >>>
> >>> Reading this, it is unclear why exactly do we need the stack.
> >>
> >> So I looked into your items below. It seems logical, and I think it
> >> works the same way; however, your method seems slightly harder to
> >> understand due to all the edge cases and more error-prone to future
> >> changes (the stack holds implicit knowledge of the offset/order that
> >> must now be tracked in the edge cases).
> >>
> >> Given the stack is 24 bytes, I'm not sure if the extra complexity is
> >> worth saving that small amount of memory. Although we would also be
> >> getting rid of (3?) functions, so both approaches have pros and cons.
> >
> > I consider a simple forward loop over the offset ... less complexity 
> > compared to
> > a stack structure :)
> >
> >>
> >> I will implement a patch comparing your solution against mine and send
> >> it here, then we can decide which approach is better.
> >
> > Right, throw it over the fence and I'll see how to improve it further.
>
> Ok heres what the diff looks like on top of my V19.
>
> you can access the tree here 
> https://gitlab.com/npache/linux/-/commits/mthp-v19?ref_type=heads for easier 
> review.
>
> So far I have no problem with this approach it appeared cleaner than i 
> thought. Did some light testing. Gonna throw it more through the ringer 
> tomorrow.

not sure why this didnt send with the proper encoding I guess my email
is still a little screwed up

>
>
> From 9496c5d17eba7f6d04820d78c7c6f1592a58888a Mon Sep 17 00:00:00 2001
> From: Nico Pache <[email protected]>
> Date: Tue, 2 Jun 2026 10:26:18 -0600
> Subject: [PATCH] convert from stack to forward loop
>
> Signed-off-by: Nico Pache <[email protected]>
> ---
>  mm/khugepaged.c | 96 ++++++++-----------------------------------------
>  1 file changed, 15 insertions(+), 81 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 498eba009751..6de935e76ceb 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -100,28 +100,6 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, 
> MM_SLOTS_HASH_BITS);
>  static struct kmem_cache *mm_slot_cache __ro_after_init;
>
>  #define KHUGEPAGED_MIN_MTHP_ORDER      2
> -/*
> - * mthp_collapse() does an iterative DFS over a binary tree, from
> - * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> - * size needed for a DFS on a binary tree is height + 1, where
> - * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> - *
> - * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> - * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
> - */
> -#define MTHP_STACK_SIZE        (ilog2(MAX_PTRS_PER_PTE) - 
> KHUGEPAGED_MIN_MTHP_ORDER + 1)
> -
> -/*
> - * Defines a range of PTE entries in a PTE page table which are being
> - * considered for mTHP collapse.
> - *
> - * @offset: the offset of the first PTE entry in a PMD range.
> - * @order: the order of the PTE entries being considered for collapse.
> - */
> -struct mthp_range {
> -       u16 offset;
> -       u8 order;
> -};
>
>  struct collapse_control {
>         bool is_khugepaged;
> @@ -137,7 +115,6 @@ struct collapse_control {
>
>         /* Each bit represents a single occupied (!none/zero) page. */
>         DECLARE_BITMAP(mthp_present_ptes, MAX_PTRS_PER_PTE);
> -       struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
>  };
>
>  /**
> @@ -1458,50 +1435,14 @@ static enum scan_result collapse_huge_page(struct 
> mm_struct *mm, unsigned long s
>         return result;
>  }
>
> -static void collapse_mthp_stack_push(struct collapse_control *cc, int 
> *stack_size,
> -                                    u16 offset, u8 order)
> -{
> -       const int size = *stack_size;
> -       struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> -
> -       VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> -       stack->order = order;
> -       stack->offset = offset;
> -       (*stack_size)++;
> -}
> -
> -static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> -                                                int *stack_size)
> -{
> -       const int size = *stack_size;
> -
> -       VM_WARN_ON_ONCE(size <= 0);
> -       (*stack_size)--;
> -       return cc->mthp_bitmap_stack[size - 1];
> -}
> -
>  /*
>   * mthp_collapse() consumes the bitmap that is generated during
>   * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
>   *
>   * Each bit in cc->mthp_present_ptes represents a single occupied 
> (!none/zero)
> - * page. A stack structure cc->mthp_bitmap_stack is used to check different
> - * regions of the bitmap for collapse eligibility. The stack maintains a pair
> - * of variables (offset, order), indicating the number of PTEs from the start
> - * of the PMD, and the order of the potential collapse candidate 
> respectively.
> - * We start at the PMD order and check if it is eligible for collapse; if 
> not,
> - * we add two entries to the stack at a lower order to represent the left and
> - * right halves of the PTE page table we are examining.
> - *
> - *                         offset       mid_offset
> - *                         |         |
> - *                         |         |
> - *                         v         v
> - *      --------------------------------------
> - *      |       cc->mthp_present_ptes         |
> - *      --------------------------------------
> - *                         <-------><------->
> - *                          order-1  order-1
> + * page. We start at the PMD order and check if it is eligible for collapse;
> + * if not, we check the left and right halves of the PTE page table we are
> + * examining at a lower order.
>   *
>   * For each of these, we determine how many PTE entries are occupied in the
>   * range of PTE entries we propose to collapse, then we compare this to a
> @@ -1517,26 +1458,20 @@ static enum scan_result mthp_collapse(struct 
> mm_struct *mm,
>  {
>         unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
>         enum scan_result last_result = SCAN_FAIL;
> -       int collapsed = 0, stack_size = 0;
> +       int collapsed = 0;
>         bool alloc_failed = false;
>         unsigned long collapse_address;
> -       struct mthp_range range;
> -       u16 offset;
> -       u8 order;
> +       unsigned int offset = 0;
> +       unsigned int order = HPAGE_PMD_ORDER;
>
> -       collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
>
> -       while (stack_size) {
> -               range = collapse_mthp_stack_pop(cc, &stack_size);
> -               order = range.order;
> -               offset = range.offset;
> +       while (offset < HPAGE_PMD_NR) {
>                 nr_ptes = 1UL << order;
>
>                 if (!test_bit(order, &enabled_orders))
>                         goto next_order;
>
>                 max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
> -
>                 nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, 
> offset,
>                                                       offset + nr_ptes);
>
> @@ -1553,7 +1488,7 @@ static enum scan_result mthp_collapse(struct mm_struct 
> *mm,
>                                 collapsed += nr_ptes;
>                                 fallthrough;
>                         case SCAN_PTE_MAPPED_HUGEPAGE:
> -                               continue;
> +                               goto next_offset;
>                         /* Cases where lower orders might still succeed */
>                         case SCAN_ALLOC_HUGE_PAGE_FAIL:
>                                 alloc_failed = true;
> @@ -1581,15 +1516,14 @@ static enum scan_result mthp_collapse(struct 
> mm_struct *mm,
>                 }
>
>  next_order:
> -               if ((BIT(order) - 1) & enabled_orders) {
> -                       const u8 next_order = order - 1;
> -                       const u16 mid_offset = offset + (nr_ptes / 2);
> -
> -                       collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> -                                                next_order);
> -                       collapse_mthp_stack_push(cc, &stack_size, offset,
> -                                                next_order);
> +               if (order > KHUGEPAGED_MIN_MTHP_ORDER &&
> +                       (BIT(order) - 1) & enabled_orders) {
> +                       order = order - 1;
> +                       continue;
>                 }
> +next_offset:
> +               offset += nr_ptes;
> +               order = min_t(int, __ffs(offset), HPAGE_PMD_ORDER);
>         }
>  done:
>         if (collapsed)
> --
> 2.54.0
>
>
>
> >
> > [...]
> >
> >>>> +     bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
> >>>>       memset(cc->node_load, 0, sizeof(cc->node_load));
> >>>>       nodes_clear(cc->alloc_nmask);
> >>>> +
> >>>> +     enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, 
> >>>> tva_flags);
> >>>> +
> >>>> +     /*
> >>>> +      * If PMD is the only enabled order, enforce max_ptes_none, 
> >>>> otherwise
> >>>> +      * scan all pages to populate the bitmap for mTHP collapse.
> >>>> +      */
> >>>
> >>> You should note here, that we re-verify in mthp_collapse().
> >>>
> >>> But the question is, whether we should relocate the check completely into
> >>> mthp_collapse(), instead of conditionally duplicating it.
> >>>
> >>> What speaks against always populating the bitmap and making the decision 
> >>> in
> >>> mthp_collapse()?
> >>>
> >>> Sure, we might scan a page table a bit longer, but the code gets clearer 
> >>> ... and
> >>> I am not sure if scanning some more page table entries is really that 
> >>> critical here.
> >>
> >> Someone asked me to preserve the legacy behavior (PMD only). Although
> >> rather trivial, if you set max_ptes_none=0 for example, we'd still
> >> have to do 511 iterations for no reason if PMD collapse is the only
> >> enabled order rather than bailing immediately.
> >>
> >> I'm ok with dropping it, but I think its the correct approach (despite
> >> the extra complexity). @Usama Arif brought up this point here
> >> https://lore.kernel.org/all/[email protected]/
> >
> > We talk about regressions, but I am not sure if we care about scanning speed
> > within a page table that much?
> >
> > After all, we locked it and already read some entries.
> >
> > Having the same check at two places to optimize for PMD order might right 
> > now
> > feel like a good optimization, but likely an irrelevant one in a near 
> > future?
> >
> > Anyhow, won't push back, as long as we document why we are special casing 
> > things
> > here.
> >
>


Reply via email to