On 05.09.25 13:48, Lorenzo Stoakes wrote:
On Wed, Sep 03, 2025 at 08:54:39PM -0600, Nico Pache wrote:
On Tue, Sep 2, 2025 at 2:23 PM Usama Arif <usamaarif...@gmail.com> wrote:
So I question the utility of max_ptes_none. If you can't tame page faults, then
there is only
limited sense in taming khugepaged. I think there is vale in setting
max_ptes_none=0 for some
corner cases, but I am yet to learn why max_ptes_none=123 would make any sense.
For PMD mapped THPs with THP shrinker, this has changed. You can basically tame
pagefaults, as when you encounter
memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR
-1 (i.e. 511 for x86), and
will break down those hugepages and free up zero-filled memory.
You are not really taming page faults, though, you are undoing what page faults
might have messed up :)
I have seen in our prod workloads where
the memory usage and THP usage can spike (usually when the workload starts),
but with memory pressure,
the memory usage is lower compared to with max_ptes_none = 511, while still
still keeping the benefits
of THPs like lower TLB misses.
Thanks for raising that: I think the current behavior is in place such that you
don't bounce back-and-forth between khugepaged collapse and shrinker-split.
Yes, both collapse and shrinker split hinge on max_ptes_none to prevent one of
these things thrashing the effect of the other.
I believe with mTHP support in khugepaged, the max_ptes_none value in
the shrinker must also leverage the 'order' scaling to properly
prevent thrashing.
No please do not extend this 'scalling' stuff somewhere else, it's really
horrid.
We have to find an alternative to that, it's extremely confusing in what is
already extremely confusing THP code.
As I said before, if we can't have a boolean we need another interface, which
makes most sense to be a ratio or in practice, a percentage sysctl.
Speaking with David off-list, maybe the answer - if we must have this - is to
add a new percentage interface and have this in lock-step with the existing
max_ptes_none interface. One updates the other, but internally we're just using
the percentage value.
Yes, I'll try hacking something up and sending it as an RFC.
I've been testing a patch for this that I might include in the V11.
There are likely other ways to achieve that, when we have in mind that the thp
shrinker will install zero pages and max_ptes_none includes
zero pages.
I do agree that the value of max_ptes_none is magical and different workloads
can react very differently
to it. The relationship is definitely not linear. i.e. if I use max_ptes_none =
256, it does not mean
that the memory regression of using THP=always vs THP=madvise is halved.
To which value would you set it? Just 510? 0?
Sorry, I missed Usama's reply. Thanks Usama!
There are some very large workloads in the meta fleet that I experimented with
and found that having
a small value works out. I experimented with 0, 51 (10%) and 256 (50%). 51 was
found to be an optimal
comprimise in terms of application metrics improving, having an acceptable
amount of memory regression and
improved system level metrics (lower TLB misses, lower page faults). I am sure
there was a better value out
there for these workloads, but not possible to experiment with every value.
(->Usama) It's a pity that such workloads exist. But then the percentage
solution should work.
Good. So if there is no strong case for > 255, that's already valuable
for mTHP.
--
Cheers
David / dhildenb