On Thu, Feb 26, 2026 at 4:33 AM Usama Arif <[email protected]> wrote: > > When the kernel creates a PMD-level THP mapping for anonymous pages, it > pre-allocates a PTE page table via pgtable_trans_huge_deposit(). This > page table sits unused in a deposit list for the lifetime of the THP > mapping, only to be withdrawn when the PMD is split or zapped. Every > anonymous THP therefore wastes 4KB of memory unconditionally. On large > servers where hundreds of gigabytes of memory are mapped as THPs, this > adds up: roughly 200MB wasted per 100GB of THP memory. This memory > could otherwise satisfy other allocations, including the very PTE page > table allocations needed when splits eventually occur. > > This series removes the pre-deposit and allocates the PTE page table > lazily — only when a PMD split actually happens. Since a large number > of THPs are never split (they are zapped wholesale when processes exit or > munmap the full range), the allocation is avoided entirely in the common > case. > > The pre-deposit pattern exists because split_huge_pmd was designed as an > operation that must never fail: if the kernel decides to split, it needs > a PTE page table, so one is deposited in advance. But "must never fail" > is an unnecessarily strong requirement. A PMD split is typically triggered > by a partial operation on a sub-PMD range — partial munmap, partial > mprotect, partial mremap and so on. > Most of these operations already have well-defined error handling for > allocation failures (e.g., -ENOMEM, VM_FAULT_OOM). Allowing split to > fail and propagating the error through these existing paths is the natural > thing to do. Furthermore, split failing requires an order-0 allocation for > a page table to fail, which is extremely unlikely. > > Designing functions like split_huge_pmd as operations that cannot fail > has a subtle but real cost to code quality. It forces a pre-allocation > pattern - every THP creation path must deposit a page table, and every > split or zap path must withdraw one, creating a hidden coupling between > widely separated code paths. > > This also serves as a code cleanup. On every architecture except powerpc > with hash MMU, the deposit/withdraw machinery becomes dead code. The > series removes the generic implementations in pgtable-generic.c and the > s390/sparc overrides, replacing them with no-op stubs guarded by > arch_needs_pgtable_deposit(), which evaluates to false at compile time > on all non-powerpc architectures.
Hi Usama, Thanks for tackling this, it seems like an interesting problem. Im trying to get more into reviewing, so bare with me I may have some stupid comments or questions. Where I can really help out is with testing. I will build this for all RH-supported architectures and run some automated test suites and performance metrics. I'll report back if I spot anything. Cheers! -- Nico > > The series is structured as follows: > > Patches 1-2: Error infrastructure — make split functions return int > and propagate errors from vma_adjust_trans_huge() > through __split_vma, vma_shrink, and commit_merge. > > Patches 3-12: Handle split failure at every call site — copy_huge_pmd, > do_huge_pmd_wp_page, zap_pmd_range, wp_huge_pmd, > change_pmd_range (mprotect), follow_pmd_mask (GUP), > walk_pmd_range (pagewalk), move_page_tables (mremap), > move_pages (userfaultfd), and device migration. > The code will become affective in Patch 14 when split > functions start returning -ENOMEM. > > Patch 13: Add __must_check to __split_huge_pmd(), split_huge_pmd() > and split_huge_pmd_address() so the compiler warns on > unchecked return values. > > Patch 14: The actual change — allocate PTE page tables lazily at > split time instead of pre-depositing at THP creation. > This is when split functions will actually start returning > -ENOMEM. > > Patch 15: Remove the now-dead deposit/withdraw code on > non-powerpc architectures. > > Patch 16: Add THP_SPLIT_PMD_FAILED vmstat counter for monitoring > split failures. > > Patches 17-21: Selftests covering partial munmap, mprotect, mlock, > mremap, and MADV_DONTNEED on THPs to exercise the > split paths. > > The error handling patches are placed before the lazy allocation patch so > that every call site is already prepared to handle split failures before > the failure mode is introduced. This makes each patch independently safe > to apply and bisect through. > > The patches were tested with CONFIG_DEBUG_ATOMIC_SLEEP and CONFIG_DEBUG_VM > enabled. The test results are below: > > TAP version 13 > 1..5 > # Starting 5 tests from 1 test cases. > # RUN thp_pmd_split.partial_munmap ... > # thp_pmd_split_test.c:60:partial_munmap:thp_split_pmd: 0 -> 1 > # thp_pmd_split_test.c:62:partial_munmap:thp_split_pmd_failed: 0 -> 0 > # OK thp_pmd_split.partial_munmap > ok 1 thp_pmd_split.partial_munmap > # RUN thp_pmd_split.partial_mprotect ... > # thp_pmd_split_test.c:60:partial_mprotect:thp_split_pmd: 1 -> 2 > # thp_pmd_split_test.c:62:partial_mprotect:thp_split_pmd_failed: 0 -> 0 > # OK thp_pmd_split.partial_mprotect > ok 2 thp_pmd_split.partial_mprotect > # RUN thp_pmd_split.partial_mlock ... > # thp_pmd_split_test.c:60:partial_mlock:thp_split_pmd: 2 -> 3 > # thp_pmd_split_test.c:62:partial_mlock:thp_split_pmd_failed: 0 -> 0 > # OK thp_pmd_split.partial_mlock > ok 3 thp_pmd_split.partial_mlock > # RUN thp_pmd_split.partial_mremap ... > # thp_pmd_split_test.c:60:partial_mremap:thp_split_pmd: 3 -> 4 > # thp_pmd_split_test.c:62:partial_mremap:thp_split_pmd_failed: 0 -> 0 > # OK thp_pmd_split.partial_mremap > ok 4 thp_pmd_split.partial_mremap > # RUN thp_pmd_split.partial_madv_dontneed ... > # thp_pmd_split_test.c:60:partial_madv_dontneed:thp_split_pmd: 4 -> 5 > # thp_pmd_split_test.c:62:partial_madv_dontneed:thp_split_pmd_failed: 0 -> 0 > # OK thp_pmd_split.partial_madv_dontneed > ok 5 thp_pmd_split.partial_madv_dontneed > # PASSED: 5 / 5 tests passed. > # Totals: pass:5 fail:0 xfail:0 xpass:0 skip:0 error:0 > > The patches are based off of 957a3fab8811b455420128ea5f41c51fd23eb6c7 from > mm-unstable as of 25 Feb (7.0.0-rc1). > > > RFC v1 -> v2: > https://lore.kernel.org/all/[email protected]/ > - Change counter name to THP_SPLIT_PMD_FAILED (David) > - remove pgtable_trans_huge_{deposit/withdraw} when not needed and > make them arch specific (David) > - make split functions return error code and have callers handle them > (David and Kiryl) > - Add test cases for splitting > > Usama Arif (21): > mm: thp: make split_huge_pmd functions return int for error > propagation > mm: thp: propagate split failure from vma_adjust_trans_huge() > mm: thp: handle split failure in copy_huge_pmd() > mm: thp: handle split failure in do_huge_pmd_wp_page() > mm: thp: handle split failure in zap_pmd_range() > mm: thp: handle split failure in wp_huge_pmd() > mm: thp: retry on split failure in change_pmd_range() > mm: thp: handle split failure in follow_pmd_mask() > mm: handle walk_page_range() failure from THP split > mm: thp: handle split failure in mremap move_page_tables() > mm: thp: handle split failure in userfaultfd move_pages() > mm: thp: handle split failure in device migration > mm: huge_mm: Make sure all split_huge_pmd calls are checked > mm: thp: allocate PTE page tables lazily at split time > mm: thp: remove pgtable_trans_huge_{deposit/withdraw} when not needed > mm: thp: add THP_SPLIT_PMD_FAILED counter > selftests/mm: add THP PMD split test infrastructure > selftests/mm: add partial_mprotect test for change_pmd_range > selftests/mm: add partial_mlock test > selftests/mm: add partial_mremap test for move_page_tables > selftests/mm: add madv_dontneed_partial test > > arch/powerpc/include/asm/book3s/64/pgtable.h | 12 +- > arch/s390/include/asm/pgtable.h | 6 - > arch/s390/mm/pgtable.c | 41 --- > arch/sparc/include/asm/pgtable_64.h | 6 - > arch/sparc/mm/tlb.c | 36 --- > include/linux/huge_mm.h | 51 +-- > include/linux/pgtable.h | 16 +- > include/linux/vm_event_item.h | 1 + > mm/debug_vm_pgtable.c | 4 +- > mm/gup.c | 10 +- > mm/huge_memory.c | 208 +++++++++---- > mm/khugepaged.c | 7 +- > mm/memory.c | 26 +- > mm/migrate_device.c | 33 +- > mm/mprotect.c | 11 +- > mm/mremap.c | 8 +- > mm/pagewalk.c | 8 +- > mm/pgtable-generic.c | 32 -- > mm/rmap.c | 42 ++- > mm/userfaultfd.c | 8 +- > mm/vma.c | 37 ++- > mm/vmstat.c | 1 + > tools/testing/selftests/mm/Makefile | 1 + > .../testing/selftests/mm/thp_pmd_split_test.c | 290 ++++++++++++++++++ > tools/testing/vma/include/stubs.h | 9 +- > 25 files changed, 645 insertions(+), 259 deletions(-) > create mode 100644 tools/testing/selftests/mm/thp_pmd_split_test.c > > -- > 2.47.3 >
