Re: [PATCH v10 00/13] khugepaged: mTHP support

David Hildenbrand Fri, 05 Sep 2025 21:50:24 -0700

On 05.09.25 13:48, Lorenzo Stoakes wrote:

On Wed, Sep 03, 2025 at 08:54:39PM -0600, Nico Pache wrote:

On Tue, Sep 2, 2025 at 2:23 PM Usama Arif <usamaarif...@gmail.com> wrote:

So I question the utility of max_ptes_none. If you can't tame page faults, then 
there is only
limited sense in taming khugepaged. I think there is vale in setting 
max_ptes_none=0 for some
corner cases, but I am yet to learn why max_ptes_none=123 would make any sense.


For PMD mapped THPs with THP shrinker, this has changed. You can basically tame 
pagefaults, as when you encounter
memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR 
-1 (i.e. 511 for x86), and
will break down those hugepages and free up zero-filled memory.


You are not really taming page faults, though, you are undoing what page faults 
might have messed up :)

I have seen in our prod workloads where

the memory usage and THP usage can spike (usually when the workload starts), 
but with memory pressure,
the memory usage is lower compared to with max_ptes_none = 511, while still 
still keeping the benefits
of THPs like lower TLB misses.


Thanks for raising that: I think the current behavior is in place such that you 
don't bounce back-and-forth between khugepaged collapse and shrinker-split.


Yes, both collapse and shrinker split hinge on max_ptes_none to prevent one of 
these things thrashing the effect of the other.

I believe with mTHP support in khugepaged, the max_ptes_none value in
the shrinker must also leverage the 'order' scaling to properly
prevent thrashing.


No please do not extend this 'scalling' stuff somewhere else, it's really 
horrid.

We have to find an alternative to that, it's extremely confusing in what is
already extremely confusing THP code.

As I said before, if we can't have a boolean we need another interface, which
makes most sense to be a ratio or in practice, a percentage sysctl.

Speaking with David off-list, maybe the answer - if we must have this - is to
add a new percentage interface and have this in lock-step with the existing
max_ptes_none interface. One updates the other, but internally we're just using
the percentage value.


Yes, I'll try hacking something up and sending it as an RFC.

I've been testing a patch for this that I might include in the V11.

There are likely other ways to achieve that, when we have in mind that the thp 
shrinker will install zero pages and max_ptes_none includes
zero pages.


I do agree that the value of max_ptes_none is magical and different workloads 
can react very differently
to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 
256, it does not mean
that the memory regression of using THP=always vs THP=madvise is halved.


To which value would you set it? Just 510? 0?


Sorry, I missed Usama's reply. Thanks Usama!


There are some very large workloads in the meta fleet that I experimented with 
and found that having
a small value works out. I experimented with 0, 51 (10%) and 256 (50%). 51 was 
found to be an optimal
comprimise in terms of application metrics improving, having an acceptable 
amount of memory regression and
improved system level metrics (lower TLB misses, lower page faults). I am sure 
there was a better value out
there for these workloads, but not possible to experiment with every value.


(->Usama) It's a pity that such workloads exist. But then the percentage 
solution should work.

Good. So if there is no strong case for > 255, that's already valuablefor mTHP.


--
Cheers

David / dhildenb

Re: [PATCH v10 00/13] khugepaged: mTHP support

Reply via email to