On 19/08/25 7:11 pm, Nico Pache wrote:
The following series provides khugepaged with the capability to collapse anonymous memory regions to mTHPs. To achieve this we generalize the khugepaged functions to no longer depend on PMD_ORDER. Then during the PMD scan, we use a bitmap to track chunks of pages (defined by KHUGEPAGED_MTHP_MIN_ORDER) that are utilized. After the PMD scan is done, we do binary recursion on the bitmap to find the optimal mTHP sizes for the PMD range. The restriction on max_ptes_none is removed during the scan, to make sure we account for the whole PMD range. When no mTHP size is enabled, the legacy behavior of khugepaged is maintained. max_ptes_none will be scaled by the attempted collapse order to determine how full a mTHP must be to be eligible for the collapse to occur. If a mTHP collapse is attempted, but contains swapped out, or shared pages, we don't perform the collapse. It is now also possible to collapse to mTHPs without requiring the PMD THP size to be enabled. With the default max_ptes_none=511, the code should keep its most of its original behavior. When enabling multiple adjacent (m)THP sizes we need to set max_ptes_none<=255. With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and constantly promote mTHPs to the next available size. This is due the fact that a collapse will introduce at least 2x the number of pages, and on a future scan will satisfy the promotion condition once again. Patch 1: Refactor/rename hpage_collapse Patch 2: Some refactoring to combine madvise_collapse and khugepaged Patch 3-5: Generalize khugepaged functions for arbitrary orders Patch 6-8: The mTHP patches Patch 9-10: Allow khugepaged to operate without PMD enabled Patch 11-12: Tracing/stats Patch 13: Documentation
For the next version, it will be really great if you can mention the lore links referencing important ideas guiding the evolution of the algorithm - say, a policy decision we make. (I frequently do that albeit I think I over-do it :)) I am asking because I am completely lost on the current discussion going on around the max_ptes_* scaling (been busy with other stuff).