Nicholas Piggin <npig...@gmail.com> writes: > Excerpts from Aneesh Kumar K.V's message of August 4, 2021 12:37 am: >> With shared mapping, even though we are unmapping a large range, the kernel >> will force a TLB flush with ptl lock held to avoid the race mentioned in >> commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and >> memory freeing parts") >> This results in the kernel issuing a high number of TLB flushes even for a >> large >> range. This can be improved by making sure the kernel switch to pid based >> flush if the >> kernel is unmapping a 2M range. > > It would be good to have a bit more description here. > > In any patch that changes a heuristic like this, I would like to see > some justification or reasoning that could be refuted or used as a > supporting argument if we ever wanted to change the heuristic later. > Ideally with some of the obvious downsides listed as well. > > This "improves" things here, but what if it hurt things elsewhere, how > would we come in later and decide to change it back? > > THP flushes for example, I think now they'll do PID flushes (if they > have to be broadcast, which they will tend to be when khugepaged does > them). So now that might increase jitter for THP and cause it to be a > loss for more workloads. > > So where do you notice this? What's the benefit?
Ack. Needs some numbers and supporting evidence. >> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c >> b/arch/powerpc/mm/book3s64/radix_tlb.c >> index aefc100d79a7..21d0f098e43b 100644 >> --- a/arch/powerpc/mm/book3s64/radix_tlb.c >> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c >> @@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range); >> * invalidating a full PID, so it has a far lower threshold to change from >> * individual page flushes to full-pid flushes. >> */ >> -static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33; >> +static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32; >> static unsigned long tlb_local_single_page_flush_ceiling __read_mostly = >> POWER9_TLB_SETS_RADIX * 2; >> >> static inline void __radix__flush_tlb_range(struct mm_struct *mm, >> @@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct >> mm_struct *mm, >> if (fullmm) >> flush_pid = true; >> else if (type == FLUSH_TYPE_GLOBAL) >> - flush_pid = nr_pages > tlb_single_page_flush_ceiling; >> + flush_pid = nr_pages >= tlb_single_page_flush_ceiling; > > Arguably >= is nicer than > here, but this shouldn't be in the same > patch as the value change. > >> else >> flush_pid = nr_pages > tlb_local_single_page_flush_ceiling; > > And it should change everything to be consistent. Although I'm not sure > it's worth changing even though I highly doubt any administrator would > be tweaking this. This made me look at how an administrator tweaks these thresholds, and AFAICS the answer is "recompile the kernel"? It looks like x86 have a debugfs file for tlb_single_page_flush_ceiling, but we don't. I guess we meant to copy that but never did? So at the moment both thresholds could just be #defines. Making them tweakable at runtime would be nice, it would give us an escape hatch if we ever hit a workload in production that wants a different value. cheers