On Tue, Jun 2, 2026 at 11:22 AM Nico Pache <[email protected]> wrote: > > > > On 6/1/26 7:15 AM, David Hildenbrand (Arm) wrote: > >>> > >>> Reading this, it is unclear why exactly do we need the stack. > >> > >> So I looked into your items below. It seems logical, and I think it > >> works the same way; however, your method seems slightly harder to > >> understand due to all the edge cases and more error-prone to future > >> changes (the stack holds implicit knowledge of the offset/order that > >> must now be tracked in the edge cases). > >> > >> Given the stack is 24 bytes, I'm not sure if the extra complexity is > >> worth saving that small amount of memory. Although we would also be > >> getting rid of (3?) functions, so both approaches have pros and cons. > > > > I consider a simple forward loop over the offset ... less complexity > > compared to > > a stack structure :) > > > >> > >> I will implement a patch comparing your solution against mine and send > >> it here, then we can decide which approach is better. > > > > Right, throw it over the fence and I'll see how to improve it further. > > Ok heres what the diff looks like on top of my V19. > > you can access the tree here > https://gitlab.com/npache/linux/-/commits/mthp-v19?ref_type=heads for easier > review. > > So far I have no problem with this approach it appeared cleaner than i > thought. Did some light testing. Gonna throw it more through the ringer > tomorrow.
not sure why this didnt send with the proper encoding I guess my email is still a little screwed up > > > From 9496c5d17eba7f6d04820d78c7c6f1592a58888a Mon Sep 17 00:00:00 2001 > From: Nico Pache <[email protected]> > Date: Tue, 2 Jun 2026 10:26:18 -0600 > Subject: [PATCH] convert from stack to forward loop > > Signed-off-by: Nico Pache <[email protected]> > --- > mm/khugepaged.c | 96 ++++++++----------------------------------------- > 1 file changed, 15 insertions(+), 81 deletions(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 498eba009751..6de935e76ceb 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -100,28 +100,6 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, > MM_SLOTS_HASH_BITS); > static struct kmem_cache *mm_slot_cache __ro_after_init; > > #define KHUGEPAGED_MIN_MTHP_ORDER 2 > -/* > - * mthp_collapse() does an iterative DFS over a binary tree, from > - * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack > - * size needed for a DFS on a binary tree is height + 1, where > - * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER. > - * > - * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures > - * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time. > - */ > -#define MTHP_STACK_SIZE (ilog2(MAX_PTRS_PER_PTE) - > KHUGEPAGED_MIN_MTHP_ORDER + 1) > - > -/* > - * Defines a range of PTE entries in a PTE page table which are being > - * considered for mTHP collapse. > - * > - * @offset: the offset of the first PTE entry in a PMD range. > - * @order: the order of the PTE entries being considered for collapse. > - */ > -struct mthp_range { > - u16 offset; > - u8 order; > -}; > > struct collapse_control { > bool is_khugepaged; > @@ -137,7 +115,6 @@ struct collapse_control { > > /* Each bit represents a single occupied (!none/zero) page. */ > DECLARE_BITMAP(mthp_present_ptes, MAX_PTRS_PER_PTE); > - struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE]; > }; > > /** > @@ -1458,50 +1435,14 @@ static enum scan_result collapse_huge_page(struct > mm_struct *mm, unsigned long s > return result; > } > > -static void collapse_mthp_stack_push(struct collapse_control *cc, int > *stack_size, > - u16 offset, u8 order) > -{ > - const int size = *stack_size; > - struct mthp_range *stack = &cc->mthp_bitmap_stack[size]; > - > - VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE); > - stack->order = order; > - stack->offset = offset; > - (*stack_size)++; > -} > - > -static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc, > - int *stack_size) > -{ > - const int size = *stack_size; > - > - VM_WARN_ON_ONCE(size <= 0); > - (*stack_size)--; > - return cc->mthp_bitmap_stack[size - 1]; > -} > - > /* > * mthp_collapse() consumes the bitmap that is generated during > * collapse_scan_pmd() to determine what regions and mTHP orders fit best. > * > * Each bit in cc->mthp_present_ptes represents a single occupied > (!none/zero) > - * page. A stack structure cc->mthp_bitmap_stack is used to check different > - * regions of the bitmap for collapse eligibility. The stack maintains a pair > - * of variables (offset, order), indicating the number of PTEs from the start > - * of the PMD, and the order of the potential collapse candidate > respectively. > - * We start at the PMD order and check if it is eligible for collapse; if > not, > - * we add two entries to the stack at a lower order to represent the left and > - * right halves of the PTE page table we are examining. > - * > - * offset mid_offset > - * | | > - * | | > - * v v > - * -------------------------------------- > - * | cc->mthp_present_ptes | > - * -------------------------------------- > - * <-------><-------> > - * order-1 order-1 > + * page. We start at the PMD order and check if it is eligible for collapse; > + * if not, we check the left and right halves of the PTE page table we are > + * examining at a lower order. > * > * For each of these, we determine how many PTE entries are occupied in the > * range of PTE entries we propose to collapse, then we compare this to a > @@ -1517,26 +1458,20 @@ static enum scan_result mthp_collapse(struct > mm_struct *mm, > { > unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none; > enum scan_result last_result = SCAN_FAIL; > - int collapsed = 0, stack_size = 0; > + int collapsed = 0; > bool alloc_failed = false; > unsigned long collapse_address; > - struct mthp_range range; > - u16 offset; > - u8 order; > + unsigned int offset = 0; > + unsigned int order = HPAGE_PMD_ORDER; > > - collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER); > > - while (stack_size) { > - range = collapse_mthp_stack_pop(cc, &stack_size); > - order = range.order; > - offset = range.offset; > + while (offset < HPAGE_PMD_NR) { > nr_ptes = 1UL << order; > > if (!test_bit(order, &enabled_orders)) > goto next_order; > > max_ptes_none = collapse_max_ptes_none(cc, NULL, order); > - > nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, > offset, > offset + nr_ptes); > > @@ -1553,7 +1488,7 @@ static enum scan_result mthp_collapse(struct mm_struct > *mm, > collapsed += nr_ptes; > fallthrough; > case SCAN_PTE_MAPPED_HUGEPAGE: > - continue; > + goto next_offset; > /* Cases where lower orders might still succeed */ > case SCAN_ALLOC_HUGE_PAGE_FAIL: > alloc_failed = true; > @@ -1581,15 +1516,14 @@ static enum scan_result mthp_collapse(struct > mm_struct *mm, > } > > next_order: > - if ((BIT(order) - 1) & enabled_orders) { > - const u8 next_order = order - 1; > - const u16 mid_offset = offset + (nr_ptes / 2); > - > - collapse_mthp_stack_push(cc, &stack_size, mid_offset, > - next_order); > - collapse_mthp_stack_push(cc, &stack_size, offset, > - next_order); > + if (order > KHUGEPAGED_MIN_MTHP_ORDER && > + (BIT(order) - 1) & enabled_orders) { > + order = order - 1; > + continue; > } > +next_offset: > + offset += nr_ptes; > + order = min_t(int, __ffs(offset), HPAGE_PMD_ORDER); > } > done: > if (collapsed) > -- > 2.54.0 > > > > > > > [...] > > > >>>> + bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE); > >>>> memset(cc->node_load, 0, sizeof(cc->node_load)); > >>>> nodes_clear(cc->alloc_nmask); > >>>> + > >>>> + enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, > >>>> tva_flags); > >>>> + > >>>> + /* > >>>> + * If PMD is the only enabled order, enforce max_ptes_none, > >>>> otherwise > >>>> + * scan all pages to populate the bitmap for mTHP collapse. > >>>> + */ > >>> > >>> You should note here, that we re-verify in mthp_collapse(). > >>> > >>> But the question is, whether we should relocate the check completely into > >>> mthp_collapse(), instead of conditionally duplicating it. > >>> > >>> What speaks against always populating the bitmap and making the decision > >>> in > >>> mthp_collapse()? > >>> > >>> Sure, we might scan a page table a bit longer, but the code gets clearer > >>> ... and > >>> I am not sure if scanning some more page table entries is really that > >>> critical here. > >> > >> Someone asked me to preserve the legacy behavior (PMD only). Although > >> rather trivial, if you set max_ptes_none=0 for example, we'd still > >> have to do 511 iterations for no reason if PMD collapse is the only > >> enabled order rather than bailing immediately. > >> > >> I'm ok with dropping it, but I think its the correct approach (despite > >> the extra complexity). @Usama Arif brought up this point here > >> https://lore.kernel.org/all/[email protected]/ > > > > We talk about regressions, but I am not sure if we care about scanning speed > > within a page table that much? > > > > After all, we locked it and already read some entries. > > > > Having the same check at two places to optimize for PMD order might right > > now > > feel like a good optimization, but likely an irrelevant one in a near > > future? > > > > Anyhow, won't push back, as long as we document why we are special casing > > things > > here. > > >
