On Tue, 28 Aug 2018 15:46:38 +0200
Peter Zijlstra <pet...@infradead.org> wrote:

> On Mon, Aug 27, 2018 at 02:44:57PM +1000, Nicholas Piggin wrote:
> 
> > powerpc may be able to use the unmap granule thing to improve
> > its page size dependent flushes, but it might prefer to go
> > a different way and track start-end for different page sizes.  
> 
> I don't really see how tracking multiple ranges would help much with
> THP. The ranges would end up being almost the same if there is a good
> mix of page sizes.

That's assuming quite large unmaps. But a lot of the time they are
going to go to a full PID flush.

> 
> But something like:
> 
> void tlb_flush_one(struct mmu_gather *tlb, unsigned long addr)
> {
>       if (tlb->cleared_ptes && (addr << BITS_PER_LONG - PAGE_SHIFT))
>               tblie_pte(addr);
>       if (tlb->cleared_pmds && (addr << BITS_PER_LONG - PMD_SHIFT))
>               tlbie_pmd(addr);
>       if (tlb->cleared_puds && (addr << BITS_PER_LONG - PUD_SHIFT))
>               tlbie_pud(addr);
> }
> 
> void tlb_flush_range(struct mmu_gather *tlb)
> {
>       unsigned long stride = 1UL << tlb_get_unmap_shift(tlb);
>       unsigned long addr;
> 
>       for (addr = tlb->start; addr < tlb->end; addr += stride)
>               tlb_flush_one(tlb, addr);
> 
>       ptesync();
> }
> 
> Should workd I think. You'll only issue multiple TLBIEs on the
> boundaries, not every stride.

Yeah we already do basically that today in the flush_tlb_range path,
just without the precise test for which page sizes

                if (hflush) {
                        hstart = (start + PMD_SIZE - 1) & PMD_MASK;
                        hend = end & PMD_MASK;
                        if (hstart == hend)
                                hflush = false;
                }

                if (gflush) {
                        gstart = (start + PUD_SIZE - 1) & PUD_MASK;
                        gend = end & PUD_MASK;
                        if (gstart == gend)
                                gflush = false;
                }

                asm volatile("ptesync": : :"memory");
                if (local) {
                        __tlbiel_va_range(start, end, pid, page_size, 
mmu_virtual_psize);
                        if (hflush)
                                __tlbiel_va_range(hstart, hend, pid,
                                                PMD_SIZE, MMU_PAGE_2M);
                        if (gflush)
                                __tlbiel_va_range(gstart, gend, pid,
                                                PUD_SIZE, MMU_PAGE_1G);
                        asm volatile("ptesync": : :"memory");

Thing is I think it's the smallish range cases you want to optimize
for. And for those we'll probably do something even smarter (like keep
a bitmap of pages to flush) because we really want to keep tlbies off
the bus whereas that's less important for x86.

Still not really seeing a reason not to implement a struct
arch_mmu_gather. A little bit of data contained to the arch is nothing
compared with the multitude of hooks and divergence of code.

Thanks,
Nick

Reply via email to