deferred tlb shootdown during page reclamation

Anshuman Khandual Thu, 27 Oct 2022 19:15:34 -0700

On 10/28/22 03:25, Barry Song wrote:
> On Fri, Oct 28, 2022 at 3:19 AM Punit Agrawal
> <punit.agra...@bytedance.com> wrote:
>>
>> [ Apologies for chiming in late in the conversation ]
>>
>> Anshuman Khandual <anshuman.khand...@arm.com> writes:
>>
>>> On 9/28/22 05:53, Barry Song wrote:
>>>> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <yangyic...@huawei.com> wrote:
>>>>> On 2022/9/27 14:16, Anshuman Khandual wrote:
>>>>>> [...]
>>>>>>
>>>>>> On 9/21/22 14:13, Yicong Yang wrote:
>>>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>>>>>> +{
>>>>>>> +    /* for small systems with small number of CPUs, TLB shootdown is 
>>>>>>> cheap */
>>>>>>> +    if (num_online_cpus() <= 4)
>>>>>> It would be great to have some more inputs from others, whether 4 (which 
>>>>>> should
>>>>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something 
>>>>>> similar)
>>>>>> is optimal for an wide range of arm64 platforms.
>>>>>>
>>>> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine
>>>> with 5,6,7
>>>> cores.
>>>> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need
>>>> this patch.
>>>>
>>>> so it seems safe to have
>>>> if (num_online_cpus()  < 8)
>>>>
>>>>> Do you prefer this macro to be static or make it configurable through 
>>>>> kconfig then
>>>>> different platforms can make choice based on their own situations? It 
>>>>> maybe hard to
>>>>> test on all the arm64 platforms.
>>>> Maybe we can have this default enabled on machines with 8 and more cpus and
>>>> provide a tlbflush_batched = on or off to allow users enable or
>>>> disable it according
>>>> to their hardware and products. Similar example: rodata=on or off.
>>> No, sounds bit excessive. Kernel command line options should not be added
>>> for every possible run time switch options.
>>>
>>>> Hi Anshuman, Will,  Catalin, Andrew,
>>>> what do you think about this approach?
>>>>
>>>> BTW, haoxin mentioned another important user scenarios for tlb bach on 
>>>> arm64:
>>>> https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf...@linux.alibaba.com/
>>>>
>>>> I do believe we need it based on the expensive cost of tlb shootdown in 
>>>> arm64
>>>> even by hardware broadcast.
>>> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH 
>>> selectively
>>> with CONFIG_EXPERT and for num_online_cpus()  > 8 ?
>> When running the test program in the commit in a VM, I saw benefits from
>> the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine,
>> ptep_clear_flush() went from ~1% in the unpatched version to not showing
>> up.
>>
>> Yicong mentioned that he didn't see any benefit for <= 4 CPUs but is
>> there any overhead? I am wondering what are the downsides of enabling
>> the config by default.
> As we are deferring tlb flush, but sometimes while we are modifying the vma
> which are deferred, we need to do a sync by flush_tlb_batched_pending() in
> mprotect() , madvise() to make sure they can see the flushed result. if nobody
> is doing mprotect(), madvise() etc in the deferred period, the overhead is 
> zero.

Right, it is difficult to justify this overhead for smaller systems,
which for sure would not benefit from this batched TLB framework.
Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation

Reply via email to