Re: [PATCH v10 00/13] khugepaged: mTHP support

Usama Arif Fri, 05 Sep 2025 05:31:45 -0700

On 05/09/2025 12:55, David Hildenbrand wrote:
> On 05.09.25 13:48, Lorenzo Stoakes wrote:
>> On Wed, Sep 03, 2025 at 08:54:39PM -0600, Nico Pache wrote:
>>> On Tue, Sep 2, 2025 at 2:23 PM Usama Arif <usamaarif...@gmail.com> wrote:
>>>>>>> So I question the utility of max_ptes_none. If you can't tame page 
>>>>>>> faults, then there is only
>>>>>>> limited sense in taming khugepaged. I think there is vale in setting 
>>>>>>> max_ptes_none=0 for some
>>>>>>> corner cases, but I am yet to learn why max_ptes_none=123 would make 
>>>>>>> any sense.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> For PMD mapped THPs with THP shrinker, this has changed. You can 
>>>>>> basically tame pagefaults, as when you encounter
>>>>>> memory pressure, the shrinker kicks in if the value is less than 
>>>>>> HPAGE_PMD_NR -1 (i.e. 511 for x86), and
>>>>>> will break down those hugepages and free up zero-filled memory.
>>>>>
>>>>> You are not really taming page faults, though, you are undoing what page 
>>>>> faults might have messed up :)
>>>>>
>>>>> I have seen in our prod workloads where
>>>>>> the memory usage and THP usage can spike (usually when the workload 
>>>>>> starts), but with memory pressure,
>>>>>> the memory usage is lower compared to with max_ptes_none = 511, while 
>>>>>> still still keeping the benefits
>>>>>> of THPs like lower TLB misses.
>>>>>
>>>>> Thanks for raising that: I think the current behavior is in place such 
>>>>> that you don't bounce back-and-forth between khugepaged collapse and 
>>>>> shrinker-split.
>>>>>
>>>>
>>>> Yes, both collapse and shrinker split hinge on max_ptes_none to prevent 
>>>> one of these things thrashing the effect of the other.
>>> I believe with mTHP support in khugepaged, the max_ptes_none value in
>>> the shrinker must also leverage the 'order' scaling to properly
>>> prevent thrashing.
>>
>> No please do not extend this 'scalling' stuff somewhere else, it's really 
>> horrid.
>>
>> We have to find an alternative to that, it's extremely confusing in what is
>> already extremely confusing THP code.
>>
>> As I said before, if we can't have a boolean we need another interface, which
>> makes most sense to be a ratio or in practice, a percentage sysctl.
>>
>> Speaking with David off-list, maybe the answer - if we must have this - is to
>> add a new percentage interface and have this in lock-step with the existing
>> max_ptes_none interface. One updates the other, but internally we're just 
>> using
>> the percentage value.
> 
> Yes, I'll try hacking something up and sending it as an RFC.
> 
>>
>>> I've been testing a patch for this that I might include in the V11.
>>>>
>>>>> There are likely other ways to achieve that, when we have in mind that 
>>>>> the thp shrinker will install zero pages and max_ptes_none includes
>>>>> zero pages.
>>>>>
>>>>>>
>>>>>> I do agree that the value of max_ptes_none is magical and different 
>>>>>> workloads can react very differently
>>>>>> to it. The relationship is definitely not linear. i.e. if I use 
>>>>>> max_ptes_none = 256, it does not mean
>>>>>> that the memory regression of using THP=always vs THP=madvise is halved.
>>>>>
>>>>> To which value would you set it? Just 510? 0?
> 
> Sorry, I missed Usama's reply. Thanks Usama!
> 
>>>>>
>>>>
>>>> There are some very large workloads in the meta fleet that I experimented 
>>>> with and found that having
>>>> a small value works out. I experimented with 0, 51 (10%) and 256 (50%). 51 
>>>> was found to be an optimal
>>>> comprimise in terms of application metrics improving, having an acceptable 
>>>> amount of memory regression and
>>>> improved system level metrics (lower TLB misses, lower page faults). I am 
>>>> sure there was a better value out
>>>> there for these workloads, but not possible to experiment with every value.
>>
>> (->Usama) It's a pity that such workloads exist. But then the percentage 
>> solution should work.
> 
> Good. So if there is no strong case for > 255, that's already valuable for 
> mTHP.
> 

tbh the default value of 511 is horrible. I have thought about sending a patch 
to change it to 0 as default
in upstream for sometime, but it might mean that people who upgrade their 
kernel might suddenly see
their memory not getting hugified and it could be confusing for them?
Re: [PATCH v10 00/13] khugepaged: mTHP support

Reply via email to