Re: [RFC PATCH 1/3] Revert "mm: always flush VMA ranges affected by zap_page_range"
at 12:16 AM, Nicholas Piggin wrote: > This reverts commit 4647706ebeee6e50f7b9f922b095f4ec94d581c3. > > Patch 99baac21e4585 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss > problem") provides a superset of the TLB flush coverage of this > commit, and even includes in the changelog "this patch supersedes > 'mm: Always flush VMA ranges affected by zap_page_range v2'". > > Reverting this avoids double flushing the TLB range, and the less > efficient flush_tlb_range() call (the mmu_gather API is more precise > about what ranges it invalidates). > > Signed-off-by: Nicholas Piggin > --- > mm/memory.c | 14 +- > 1 file changed, 1 insertion(+), 13 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 7206a634270b..9d472e00fc2d 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1603,20 +1603,8 @@ void zap_page_range(struct vm_area_struct *vma, > unsigned long start, > tlb_gather_mmu(&tlb, mm, start, end); > update_hiwater_rss(mm); > mmu_notifier_invalidate_range_start(mm, start, end); > - for ( ; vma && vma->vm_start < end; vma = vma->vm_next) { > + for ( ; vma && vma->vm_start < end; vma = vma->vm_next) > unmap_single_vma(&tlb, vma, start, end, NULL); > - > - /* > - * zap_page_range does not specify whether mmap_sem should be > - * held for read or write. That allows parallel zap_page_range > - * operations to unmap a PTE and defer a flush meaning that > - * this call observes pte_none and fails to flush the TLB. > - * Rather than adding a complex API, ensure that no stale > - * TLB entries exist when this call returns. > - */ > - flush_tlb_range(vma, start, end); > - } > - > mmu_notifier_invalidate_range_end(mm, start, end); > tlb_finish_mmu(&tlb, start, end); > } Yes, this was in my “to check when I have time” todo list, especially since the flush was from start to end, not even vma->vm_start to vma->vm_end. The revert seems correct. Reviewed-by: Nadav Amit
Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX
> On Jun 1, 2023, at 1:50 PM, Edgecombe, Rick P > wrote: > > On Thu, 2023-06-01 at 14:38 -0400, Kent Overstreet wrote: >> On Thu, Jun 01, 2023 at 06:13:44PM +, Edgecombe, Rick P wrote: text_poke() _does_ create a separate RW mapping. >>> >>> Sorry, I meant a separate RW allocation. >> >> Ah yes, that makes sense >> >> >>> The thing that sucks about text_poke() is that it always does a full TLB flush, and AFAICT that's not remotely needed. What it really wants to be doing is conceptually just kmap_local() mempcy() kunmap_loca() flush_icache(); ...except that kmap_local() won't actually create a new mapping on non-highmem architectures, so text_poke() open codes it. >>> >>> Text poke creates only a local CPU RW mapping. It's more secure >>> because >>> other threads can't write to it. >> >> *nod*, same as kmap_local > > It's only used and flushed locally, but it is accessible to all CPU's, > right? > >> >>> It also only needs to flush the local core when it's done since >>> it's >>> not using a shared MM. >> >> Ahh! Thanks for that; perhaps the comment in text_poke() about IPIs >> could be a bit clearer. >> >> What is it (if anything) you don't like about text_poke() then? It >> looks >> like it's doing broadly similar things to kmap_local(), so should be >> in the same ballpark from a performance POV? > > The way text_poke() is used here, it is creating a new writable alias > and flushing it for *each* write to the module (like for each write of > an individual relocation, etc). I was just thinking it might warrant > some batching or something. I am not advocating to do so, but if you want to have many efficient writes, perhaps you can just disable CR0.WP. Just saying that if you are about to write all over the memory, text_poke() does not provide too much security for the poking thread.
Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX
> On Jun 5, 2023, at 9:10 AM, Edgecombe, Rick P > wrote: > > On Mon, 2023-06-05 at 11:11 +0300, Mike Rapoport wrote: >> On Sun, Jun 04, 2023 at 10:52:44PM -0400, Steven Rostedt wrote: >>> On Thu, 1 Jun 2023 16:54:36 -0700 >>> Nadav Amit wrote: >>> >>>>> The way text_poke() is used here, it is creating a new writable >>>>> alias >>>>> and flushing it for *each* write to the module (like for each >>>>> write of >>>>> an individual relocation, etc). I was just thinking it might >>>>> warrant >>>>> some batching or something. >> >>>> I am not advocating to do so, but if you want to have many >>>> efficient >>>> writes, perhaps you can just disable CR0.WP. Just saying that if >>>> you >>>> are about to write all over the memory, text_poke() does not >>>> provide >>>> too much security for the poking thread. >> >> Heh, this is definitely and easier hack to implement :) > > I don't know the details, but previously there was some strong dislike > of CR0.WP toggling. And now there is also the problem of CET. Setting > CR0.WP=0 will #GP if CR4.CET is 1 (as it currently is for kernel IBT). > I guess you might get away with toggling them both in some controlled > situation, but it might be a lot easier to hack up then to be made > fully acceptable. It does sound much more efficient though. Thanks for highlighting this issue. I understand the limitations of CR0.WP. There is also always the concerns that without CET or other control flow integrity mechanism, someone would abuse (using ROP/JOP) functions that clear CR0.WP…
Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
> On Jun 19, 2023, at 10:09 AM, Andy Lutomirski wrote: > > But jit_text_alloc() can't do this, because the order of operations doesn't > match. With jit_text_alloc(), the executable mapping shows up before the > text is populated, so there is no atomic change from not-there to > populated-and-executable. Which means that there is an opportunity for CPUs, > speculatively or otherwise, to start filling various caches with intermediate > states of the text, which means that various architectures (even x86!) may > need serialization. > > For eBPF- and module- like use cases, where JITting/code gen is quite > coarse-grained, perhaps something vaguely like: > > jit_text_alloc() -> returns a handle and an executable virtual address, but > does *not* map it there > jit_text_write() -> write to that handle > jit_text_map() -> map it and synchronize if needed (no sync needed on x86, I > think) Andy, would you mind explaining why you think a sync is not needed? I mean I have a “feeling” that perhaps TSO can guarantee something based on the order of write and page-table update. Is that the argument? On this regard, one thing that I clearly do not understand is why *today* it is ok for users of bpf_arch_text_copy() not to call text_poke_sync(). Am I missing something?
Re: [PATCH v2 2/2] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to MM CPUs
> > On Jun 20, 2023, at 7:46 AM, Yair Podemsky wrote: > > @@ -1525,7 +1525,7 @@ static void collapse_and_free_pmd(struct mm_struct *mm, > struct vm_area_struct *v > addr + HPAGE_PMD_SIZE); > mmu_notifier_invalidate_range_start(&range); > pmd = pmdp_collapse_flush(vma, addr, pmdp); > - tlb_remove_table_sync_one(); > + tlb_remove_table_sync_one(mm); Can’t pmdp_collapse_flush() have one additional argument “freed_tables” that it would propagate, for instance on x86 to flush_tlb_mm_range() ? Then you would not need tlb_remove_table_sync_one() to issue an additional IPI, no? It just seems that you might still have 2 IPIs in many cases instead of one, and unless I am missing something, I don’t see why.
Re: [PATCH v3 1/4] Make place for common balloon code
On Aug 22, 2022, at 4:37 AM, Alexander Atanasov wrote: > mm/balloon_compaction.c -> mm/balloon.c > File already contains code that is common along balloon > drivers so rename it to reflect its contents. > > include/linux/balloon_compaction.h -> include/linux/balloon.h > Remove it from files which do not actually use it. > Drop externs from function delcarations. > > Signed-off-by: Alexander Atanasov Makes so much sense. Acked-by: Nadav Amit
Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation
> On Sep 14, 2022, at 11:42 PM, Barry Song <21cn...@gmail.com> wrote: > >> >> The very idea behind TLB deferral is the opportunity it (might) provide >> to accumulate address ranges and cpu masks so that individual TLB flush >> can be replaced with a more cost effective range based TLB flush. Hence >> I guess unless address range or cpumask based cost effective TLB flush >> is available, deferral does not improve the unmap performance as much. > > > After sending tlbi, if we wait for the completion of tlbi, we have to get Ack > from all cpus in the system, tlbi is not scalable. The point here is that we > avoid waiting for each individual TLBi. Alternatively, they are batched. If > you read the benchmark in the commit log, you can find the great decline > in the cost to swap out a page. Just a minor correction: arch_tlbbatch_flush() does not collect ranges. On x86 it only accumulate CPU mask.
Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation
On Sep 20, 2022, at 11:53 PM, Anshuman Khandual wrote: > ⚠ External Email > > On 8/22/22 13:51, Yicong Yang wrote: >> +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch >> *batch, >> + struct mm_struct *mm, >> + unsigned long uaddr) >> +{ >> + __flush_tlb_page_nosync(mm, uaddr); >> +} >> + >> +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch >> *batch) >> +{ >> + dsb(ish); >> +} > > Just wondering if arch_tlbbatch_add_mm() could also detect continuous mapping > TLB invalidation requests on a given mm and try to generate a range based TLB > invalidation such as flush_tlb_range(). > > struct arch_tlbflush_unmap_batch via task->tlb_ubc->arch can track continuous > ranges while being queued up via arch_tlbbatch_add_mm(), any range formed can > later be flushed in subsequent arch_tlbbatch_flush() ? > > OR > > It might not be worth the effort and complexity, in comparison to performance > improvement, TLB range flush brings in ? So here are my 2 cents, based on my experience with Intel-x86. It is likely different on arm64, but perhaps it can provide you some insight into what parameters you should measure and consider. In general there is a tradeoff between full TLB flushes and entry-specific ones. Flushing specific entries takes more time than flushing the entire TLB, but sade TLB refills. Dave Hansen made some calculations in the past and came up with 33 as a magic cutoff number, i.e., if you need to flush more than 33 entries, just flush the entire TLB. I am not sure that this exact number is very meaningful, since one might argue that it should’ve taken PTI into account (which might require twice as many TLB invalidations). Anyhow, back to arch_tlbbatch_add_mm(). It may be possible to track ranges, but the question is whether you would actually succeed in forming continuous ranges that are eventually (on x86) smaller than the full TLB flush cutoff (=33). Questionable (perhaps better with MGLRU?). Then, you should remember that tracking should be very efficient, since even few cache misses might have greater cost than what you save by selective-flushing. Finally, on x86 you would need to invoke the smp/IPI layer multiple times to send different cores the relevant range they need to flush. IOW: It is somewhat complicated to implement efficeintly. On x86, and probably other IPI-based TLB shootdown systems, does not have clear performance benefit (IMHO).
Re: [PATCH v6 2/5] lazy tlb: allow lazy tlb mm refcounting to be configurable
On 1/18/23 10:00 AM, Nicholas Piggin wrote: Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm when it is context switched. This can be disabled by architectures that don't require this refcounting if they clean up lazy tlb mms when the last refcount is dropped. Currently this is always enabled, which is what existing code does, so the patch is effectively a no-op. Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is. Signed-off-by: Nicholas Piggin --- Documentation/mm/active_mm.rst | 6 ++ arch/Kconfig | 17 + include/linux/sched/mm.h | 18 +++--- kernel/sched/core.c| 22 ++ kernel/sched/sched.h | 4 +++- 5 files changed, 59 insertions(+), 8 deletions(-) diff --git a/Documentation/mm/active_mm.rst b/Documentation/mm/active_mm.rst index 6f8269c284ed..2b0d08332400 100644 --- a/Documentation/mm/active_mm.rst +++ b/Documentation/mm/active_mm.rst @@ -4,6 +4,12 @@ Active MM = +Note, the mm_count refcount may no longer include the "lazy" users +(running tasks with ->active_mm == mm && ->mm == NULL) on kernels +with CONFIG_MMU_LAZY_TLB_REFCOUNT=n. Taking and releasing these lazy +references must be done with mmgrab_lazy_tlb() and mmdrop_lazy_tlb() +helpers which abstracts this config option. + :: List: linux-kernel diff --git a/arch/Kconfig b/arch/Kconfig index 12e3ddabac9d..b07d36f08fea 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -465,6 +465,23 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM irqs disabled over activate_mm. Architectures that do IPI based TLB shootdowns should enable this. +# Use normal mm refcounting for MMU_LAZY_TLB kernel thread references. +# MMU_LAZY_TLB_REFCOUNT=n can improve the scalability of context switching +# to/from kernel threads when the same mm is running on a lot of CPUs (a large +# multi-threaded application), by reducing contention on the mm refcount. +# +# This can be disabled if the architecture ensures no CPUs are using an mm as a +# "lazy tlb" beyond its final refcount (i.e., by the time __mmdrop frees the mm +# or its kernel page tables). This could be arranged by arch_exit_mmap(), or +# final exit(2) TLB flush, for example. +# +# To implement this, an arch *must*: +# Ensure the _lazy_tlb variants of mmgrab/mmdrop are used when dropping the +# lazy reference of a kthread's ->active_mm (non-arch code has been converted +# already). +config MMU_LAZY_TLB_REFCOUNT + def_bool y + config ARCH_HAVE_NMI_SAFE_CMPXCHG bool diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 5376caf6fcf3..68bbe8d90c2e 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -82,17 +82,29 @@ static inline void mmdrop_sched(struct mm_struct *mm) /* Helpers for lazy TLB mm refcounting */ static inline void mmgrab_lazy_tlb(struct mm_struct *mm) { - mmgrab(mm); + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) + mmgrab(mm); } static inline void mmdrop_lazy_tlb(struct mm_struct *mm) { - mmdrop(mm); + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) { + mmdrop(mm); + } else { + /* +* mmdrop_lazy_tlb must provide a full memory barrier, see the +* membarrier comment finish_task_switch which relies on this. +*/ + smp_mb(); + } } Considering the fact that mmdrop_lazy_tlb() replaced mmdrop() in various locations in which smp_mb() was not required, this comment might be confusing. IOW, for the cases in most cases where mmdrop_lazy_tlb() replaced mmdrop(), this comment was irrelevant, and therefore it now becomes confusing. I am not sure the include the smp_mb() here instead of "open-coding" it helps. static inline void mmdrop_lazy_tlb_sched(struct mm_struct *mm) { - mmdrop_sched(mm); + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) + mmdrop_sched(mm); + else + smp_mb(); // see above } Wrong style of comment.
Re: [PATCH v6 2/5] lazy tlb: allow lazy tlb mm refcounting to be configurable
On 1/23/23 9:35 AM, Nadav Amit wrote: + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) { + mmdrop(mm); + } else { + /* + * mmdrop_lazy_tlb must provide a full memory barrier, see the + * membarrier comment finish_task_switch which relies on this. + */ + smp_mb(); + } } Considering the fact that mmdrop_lazy_tlb() replaced mmdrop() in various locations in which smp_mb() was not required, this comment might be confusing. IOW, for the cases in most cases where mmdrop_lazy_tlb() replaced mmdrop(), this comment was irrelevant, and therefore it now becomes confusing. I am not sure the include the smp_mb() here instead of "open-coding" it helps. I think that I now understand why you do need the smp_mb() here, so ignore my comment.
Re: [PATCH v6 3/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme
On 1/19/23 6:22 AM, Nicholas Piggin wrote: On Thu Jan 19, 2023 at 8:22 AM AEST, Nadav Amit wrote: On Jan 18, 2023, at 12:00 AM, Nicholas Piggin wrote: +static void do_shoot_lazy_tlb(void *arg) +{ + struct mm_struct *mm = arg; + + if (current->active_mm == mm) { + WARN_ON_ONCE(current->mm); + current->active_mm = &init_mm; + switch_mm(mm, &init_mm, current); + } +} I might be out of touch - doesn’t a flush already take place when we free the page-tables, at least on common cases on x86? IIUC exit_mmap() would free page-tables, and whenever page-tables are freed, on x86, we do shootdown regardless to whether the target CPU TLB state marks is_lazy. Then, flush_tlb_func() should call switch_mm_irqs_off() and everything should be fine, no? [ I understand you care about powerpc, just wondering on the effect on x86 ] Now I come to think of it, Rik had done this for x86 a while back. https://lore.kernel.org/all/20180728215357.3249-10-r...@surriel.com/ I didn't know about it when I wrote this, so I never dug into why it didn't get merged. It might have missed the final __mmdrop races but I'm not not sure, x86 lazy tlb mode is too complicated to know at a glance. I would check with him though. My point was that naturally (i.e., as done today), when exit_mmap() is done, you release the page tables (not just the pages). On x86 it means that you also send shootdown IPI to all the *lazy* CPUs to perform a flush, so they would exit the lazy mode. [ this should be true for 99% of the cases, excluding cases where there were not page-tables, for instance ] So the patch of Rik, I think, does not help in the common cases, although it may perhaps make implicit actions more explicit in the code.
Re: [RFC PATCH 3/7] module: prepare to handle ROX allocations for text
> On 11 Apr 2024, at 19:05, Mike Rapoport wrote: > > @@ -2440,7 +2479,24 @@ static int post_relocation(struct module *mod, const > struct load_info *info) > add_kallsyms(mod, info); > > /* Arch-specific module finalizing. */ > - return module_finalize(info->hdr, info->sechdrs, mod); > + ret = module_finalize(info->hdr, info->sechdrs, mod); > + if (ret) > + return ret; > + > + for_each_mod_mem_type(type) { > + struct module_memory *mem = &mod->mem[type]; > + > + if (mem->is_rox) { > + if (!execmem_update_copy(mem->base, mem->rw_copy, > + mem->size)) > + return -ENOMEM; > + > + vfree(mem->rw_copy); > + mem->rw_copy = NULL; > + } > + } > + > + return 0; > } I might be missing something, but it seems a bit racy. IIUC, module_finalize() calls alternatives_smp_module_add(). At this point, since you don’t hold the text_mutex, some might do text_poke(), e.g., by enabling/disabling static-key, and the update would be overwritten. No?
Re: [RFC PATCH 3/7] module: [
> On 18 Apr 2024, at 13:20, Mike Rapoport wrote: > > On Tue, Apr 16, 2024 at 12:36:08PM +0300, Nadav Amit wrote: >> >> >> >> I might be missing something, but it seems a bit racy. >> >> IIUC, module_finalize() calls alternatives_smp_module_add(). At this >> point, since you don’t hold the text_mutex, some might do text_poke(), >> e.g., by enabling/disabling static-key, and the update would be >> overwritten. No? > > Right :( > Even worse, for UP case alternatives_smp_unlock() will "patch" still empty > area. > > So I'm thinking about calling alternatives_smp_module_add() from an > additional callback after the execmem_update_copy(). > > Does it make sense to you? Going over the code again - I might have just been wrong: I confused the alternatives and the jump-label mechanisms (as they do share a lot of code and characteristics). The jump-labels are updated when prepare_coming_module() is called, which happens after post_relocation() [which means they would be updated using text_poke() “inefficiently” but should be safe]. The “alternatives” appear only to use text_poke() (in contrast for text_poke_early()) from very specific few flows, e.g., common_cpu_up() -> alternatives_enable_smp(). Are those flows pose a problem after boot? Anyhow, sorry for the noise.
Re: [PATCH v1 4/6] mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwrite
On Nov 2, 2022, at 12:12 PM, David Hildenbrand wrote: > !! External Email > > commit b191f9b106ea ("mm: numa: preserve PTE write permissions across a > NUMA hinting fault") added remembering write permissions using ordinary > pte_write() for PROT_NONE mapped pages to avoid write faults when > remapping the page !PROT_NONE on NUMA hinting faults. > [ snip ] Here’s a very shallow reviewed with some minor points... > --- > include/linux/mm.h | 2 ++ > mm/huge_memory.c | 28 +--- > mm/ksm.c | 9 - > mm/memory.c| 19 --- > mm/mprotect.c | 7 ++- > 5 files changed, 41 insertions(+), 24 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 25ff9a14a777..a0deeece5e87 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1975,6 +1975,8 @@ extern unsigned long move_page_tables(struct > vm_area_struct *vma, > #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ >MM_CP_UFFD_WP_RESOLVE) > > +bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, > +pte_t pte); It might not be customary, but how about marking it as __pure? > extern unsigned long change_protection(struct mmu_gather *tlb, > struct vm_area_struct *vma, unsigned long start, > unsigned long end, pgprot_t newprot, > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 2ad68e91896a..45abd27d75a0 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1462,8 +1462,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) >unsigned long haddr = vmf->address & HPAGE_PMD_MASK; >int page_nid = NUMA_NO_NODE; >int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK); > - bool migrated = false; > - bool was_writable = pmd_savedwrite(oldpmd); > + bool try_change_writable, migrated = false; >int flags = 0; > >vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); > @@ -1472,13 +1471,22 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) >goto out; >} > > + /* See mprotect_fixup(). */ > + if (vma->vm_flags & VM_SHARED) > + try_change_writable = vma_wants_writenotify(vma, > vma->vm_page_prot); > + else > + try_change_writable = !!(vma->vm_flags & VM_WRITE); Do you find it better to copy the code instead of extracting it to a separate function? > + >pmd = pmd_modify(oldpmd, vma->vm_page_prot); >page = vm_normal_page_pmd(vma, haddr, pmd); >if (!page) >goto out_map; > >/* See similar comment in do_numa_page for explanation */ > - if (!was_writable) > + if (try_change_writable && !pmd_write(pmd) && > +can_change_pmd_writable(vma, vmf->address, pmd)) > + pmd = pmd_mkwrite(pmd); > + if (!pmd_write(pmd)) >flags |= TNF_NO_GROUP; > >page_nid = page_to_nid(page); > @@ -1523,8 +1531,12 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) >/* Restore the PMD */ >pmd = pmd_modify(oldpmd, vma->vm_page_prot); >pmd = pmd_mkyoung(pmd); > - if (was_writable) > + > + /* Similar to mprotect() protection updates, avoid write faults. */ > + if (try_change_writable && !pmd_write(pmd) && > +can_change_pmd_writable(vma, vmf->address, pmd)) Why do I have a deja-vu? :) There must be a way to avoid the redundant code and specifically the call to can_change_pmd_writable(), no? >pmd = pmd_mkwrite(pmd); > + >set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd); >update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); >spin_unlock(vmf->ptl); > @@ -1764,11 +1776,10 @@ int change_huge_pmd(struct mmu_gather *tlb, struct > vm_area_struct *vma, >struct mm_struct *mm = vma->vm_mm; >spinlock_t *ptl; >pmd_t oldpmd, entry; > - bool preserve_write; > - int ret; >bool prot_numa = cp_flags & MM_CP_PROT_NUMA; >bool uffd_wp = cp_flags & MM_CP_UFFD_WP; >bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; > + int ret = 1; > >tlb_change_page_size(tlb, HPAGE_PMD_SIZE); > > @@ -1779,9 +1790,6 @@ int change_huge_pmd(struct mmu_gather *tlb, struct > vm_area_struct *vma, >if (!ptl) >return 0; > > - preserve_write = prot_numa && pmd_write(*pmd); > - ret = 1; > - > #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION >if (is_swap_pmd(*pmd)) { >swp_entry_t entry = pmd_to_swp_entry(*pmd); > @@ -1861,8 +1869,6 @@ int change_huge_pmd(struct mmu_gather *tlb, struct > vm_area_struct *vma, >oldpmd = pmdp_invalidate_ad(vma, addr, pmd); > >entry = pmd_modify(oldpmd, newprot); > - if (preserve_write) > - entry = pmd_mk_savedwrite(entry); >if (uff
Re: [PATCH v6 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
On Nov 14, 2022, at 7:14 PM, Yicong Yang wrote: > diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h > index 8a497d902c16..5bd78ae55cd4 100644 > --- a/arch/x86/include/asm/tlbflush.h > +++ b/arch/x86/include/asm/tlbflush.h > @@ -264,7 +264,8 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm) > } > > static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch > *batch, > - struct mm_struct *mm) > + struct mm_struct *mm, > + unsigned long uaddr) Logic-wise it looks fine. I notice the “v6", and it should not be blocking, but I would note that the name "arch_tlbbatch_add_mm()” does not make much sense once the function also takes an address. It could’ve been something like arch_set_tlb_ubc_flush_pending() but that’s too long. I’m not very good with naming, but the current name is not great.
Re: [PATCH v6 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
On Nov 15, 2022, at 5:50 PM, Yicong Yang wrote: > !! External Email > > On 2022/11/16 7:38, Nadav Amit wrote: >> On Nov 14, 2022, at 7:14 PM, Yicong Yang wrote: >> >>> diff --git a/arch/x86/include/asm/tlbflush.h >>> b/arch/x86/include/asm/tlbflush.h >>> index 8a497d902c16..5bd78ae55cd4 100644 >>> --- a/arch/x86/include/asm/tlbflush.h >>> +++ b/arch/x86/include/asm/tlbflush.h >>> @@ -264,7 +264,8 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm) >>> } >>> >>> static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch >>> *batch, >>> -struct mm_struct *mm) >>> +struct mm_struct *mm, >>> +unsigned long uaddr) >> >> Logic-wise it looks fine. I notice the “v6", and it should not be blocking, >> but I would note that the name "arch_tlbbatch_add_mm()” does not make much >> sense once the function also takes an address. > > ok the add_mm should still apply to x86 since the address is not used, but > not for arm64. > >> It could’ve been something like arch_set_tlb_ubc_flush_pending() but that’s >> too long. I’m not very good with naming, but the current name is not great. > > What about arch_tlbbatch_add_pending()? Considering the x86 is pending the > flush operation > while arm64 is pending the sychronization operation, > arch_tlbbatch_add_pending() should > make sense to both. Sounds reasonable. Thanks.
Re: [PATCH v6 3/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme
> On Jan 18, 2023, at 12:00 AM, Nicholas Piggin wrote: > > +static void do_shoot_lazy_tlb(void *arg) > +{ > + struct mm_struct *mm = arg; > + > + if (current->active_mm == mm) { > + WARN_ON_ONCE(current->mm); > + current->active_mm = &init_mm; > + switch_mm(mm, &init_mm, current); > + } > +} I might be out of touch - doesn’t a flush already take place when we free the page-tables, at least on common cases on x86? IIUC exit_mmap() would free page-tables, and whenever page-tables are freed, on x86, we do shootdown regardless to whether the target CPU TLB state marks is_lazy. Then, flush_tlb_func() should call switch_mm_irqs_off() and everything should be fine, no? [ I understand you care about powerpc, just wondering on the effect on x86 ]
Re: [RFC v2 1/2] [NEEDS HELP] x86/mm: Handle unlazying membarrier core sync in the arch code
I am not very familiar with membarrier, but here are my 2 cents while trying to answer your questions. > On Dec 3, 2020, at 9:26 PM, Andy Lutomirski wrote: > @@ -496,6 +497,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct > mm_struct *next, >* from one thread in a process to another thread in the same >* process. No TLB flush required. >*/ > + > + // XXX: why is this okay wrt membarrier? > if (!was_lazy) > return; I am confused. On one hand, it seems that membarrier_private_expedited() would issue an IPI to that core, as it would find that this core’s cpu_rq(cpu)->curr->mm is the same as the one that the membarrier applies to. But… (see below) > @@ -508,12 +511,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct > mm_struct *next, > smp_mb(); > next_tlb_gen = atomic64_read(&next->context.tlb_gen); > if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) == > - next_tlb_gen) > + next_tlb_gen) { > + /* > + * We're reactivating an mm, and membarrier might > + * need to serialize. Tell membarrier. > + */ > + > + // XXX: I can't understand the logic in > + // membarrier_mm_sync_core_before_usermode(). What's > + // the mm check for? > + membarrier_mm_sync_core_before_usermode(next); On the other hand the reason for this mm check that you mention contradicts my previous understanding as the git log says: commit 2840cf02fae627860156737e83326df354ee4ec6 Author: Mathieu Desnoyers Date: Thu Sep 19 13:37:01 2019 -0400 sched/membarrier: Call sync_core only before usermode for same mm When the prev and next task's mm change, switch_mm() provides the core serializing guarantees before returning to usermode. The only case where an explicit core serialization is needed is when the scheduler keeps the same mm for prev and next. > /* >* When switching through a kernel thread, the loop in >* membarrier_{private,global}_expedited() may have observed that >* kernel thread and not issued an IPI. It is therefore possible to >* schedule between user->kernel->user threads without passing though >* switch_mm(). Membarrier requires a barrier after storing to > - * rq->curr, before returning to userspace, so provide them here: > + * rq->curr, before returning to userspace, and mmdrop() provides > + * this barrier. >* > - * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly > - * provided by mmdrop(), > - * - a sync_core for SYNC_CORE. > + * XXX: I don't think mmdrop() actually does this. There's no > + * smp_mb__before/after_atomic() in there. I presume that since x86 is the only one that needs membarrier_mm_sync_core_before_usermode(), nobody noticed the missing smp_mb__before/after_atomic(). These are anyhow a compiler barrier in x86, and such a barrier would take place before the return to userspace.
[RFC 00/20] TLB batching consolidation and enhancements
From: Nadav Amit There are currently (at least?) 5 different TLB batching schemes in the kernel: 1. Using mmu_gather (e.g., zap_page_range()). 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the ongoing deferred TLB flush and flushing the entire range eventually (e.g., change_protection_range()). 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?). 4. Batching per-table flushes (move_ptes()). 5. By setting a flag on that a deferred TLB flush operation takes place, flushing when (try_to_unmap_one() on x86). It seems that (1)-(4) can be consolidated. In addition, it seems that (5) is racy. It also seems there can be many redundant TLB flushes, and potentially TLB-shootdown storms, for instance during batched reclamation (using try_to_unmap_one()) if at the same time mmu_gather defers TLB flushes. More aggressive TLB batching may be possible, but this patch-set does not add such batching. The proposed changes would enable such batching in a later time. Admittedly, I do not understand how things are not broken today, which frightens me to make further batching before getting things in order. For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes for each page-table (but not in greater granularity). Can't ClearPageDirty() be called before the flush, causing writes after ClearPageDirty() and before the flush to be lost? This patch-set therefore performs the following changes: 1. Change mprotect, task_mmu and mapping_dirty_helpers to use mmu_gather instead of {inc|dec}_tlb_flush_pending(). 2. Avoid TLB flushes if PTE permission is not demoted. 3. Cleans up mmu_gather to be less arch-dependant. 4. Uses mm's generations to track in finer granularity, either per-VMA or per page-table, whether a pending mmu_gather operation is outstanding. This should allow to avoid some TLB flushes when KSM or memory reclamation takes place while another operation such as munmap() or mprotect() is running. 5. Changes try_to_unmap_one() flushing scheme, as the current seems broken to track in a bitmap which CPUs have outstanding TLB flushes instead of having a flag. Further optimizations are possible, such as changing move_ptes() to use mmu_gather. The patches were very very lightly tested. I am looking forward for your feedback regarding the overall approaches, and whether to split them into multiple patch-sets. Cc: Andrea Arcangeli Cc: Andrew Morton Cc: Andy Lutomirski Cc: Dave Hansen Cc: linux-c...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s...@vger.kernel.org Cc: Mel Gorman Cc: Nick Piggin Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Will Deacon Cc: x...@kernel.org Cc: Yu Zhao Nadav Amit (20): mm/tlb: fix fullmm semantics mm/mprotect: use mmu_gather mm/mprotect: do not flush on permission promotion mm/mapping_dirty_helpers: use mmu_gather mm/tlb: move BATCHED_UNMAP_TLB_FLUSH to tlb.h fs/task_mmu: use mmu_gather interface of clear-soft-dirty mm: move x86 tlb_gen to generic code mm: store completed TLB generation mm: create pte/pmd_tlb_flush_pending() mm: add pte_to_page() mm/tlb: remove arch-specific tlb_start/end_vma() mm/tlb: save the VMA that is flushed during tlb_start_vma() mm/tlb: introduce tlb_start_ptes() and tlb_end_ptes() mm: move inc/dec_tlb_flush_pending() to mmu_gather.c mm: detect deferred TLB flushes in vma granularity mm/tlb: per-page table generation tracking mm/tlb: updated completed deferred TLB flush conditionally mm: make mm_cpumask() volatile lib/cpumask: introduce cpumask_atomic_or() mm/rmap: avoid potential races arch/arm/include/asm/bitops.h | 4 +- arch/arm/include/asm/pgtable.h| 4 +- arch/arm64/include/asm/pgtable.h | 4 +- arch/csky/Kconfig | 1 + arch/csky/include/asm/tlb.h | 12 -- arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/tlb.h| 2 - arch/s390/Kconfig | 1 + arch/s390/include/asm/tlb.h | 3 - arch/sparc/Kconfig| 1 + arch/sparc/include/asm/pgtable_64.h | 9 +- arch/sparc/include/asm/tlb_64.h | 2 - arch/sparc/mm/init_64.c | 2 +- arch/x86/Kconfig | 3 + arch/x86/hyperv/mmu.c | 2 +- arch/x86/include/asm/mmu.h| 10 - arch/x86/include/asm/mmu_context.h| 1 - arch/x86/include/asm/paravirt_types.h | 2 +- arch/x86/include/asm/pgtable.h| 24 +-- arch/x86/include/asm/tlb.h| 21 +- arch/x86/include/asm/tlbbatch.h | 15 -- arch/x86/include/asm/tlbflush.h | 61 -- arch/x86/mm/tlb.c | 52 +++-- arch/x86/xen/mmu_pv.c | 2 +- drivers/firmware/efi/efi.c| 1 + fs/proc/task_mmu.c| 29 ++- include/asm-generic/bitops/find.h | 8 +- include/asm-generic/tlb.h
[RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()
From: Nadav Amit Architecture-specific tlb_start_vma() and tlb_end_vma() seem unnecessary. They are currently used for: 1. Avoid per-VMA TLB flushes. This can be determined by introducing a new config option. 2. Avoid saving information on the vma that is being flushed. Saving this information, even for architectures that do not need it, is cheap and we will need it for per-VMA deferred TLB flushing. 3. Avoid calling flush_cache_range(). Remove the architecture specific tlb_start_vma() and tlb_end_vma() in the following manner, corresponding to the previous requirements: 1. Introduce a new config option - ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING - to allow architectures to define whether they want aggressive TLB flush batching (instead of flushing mappings of each VMA separately). 2. Save information on the vma regardless of architecture. Saving this information should have negligible overhead, and they will be needed for fine granularity TLB flushes. 3. flush_cache_range() is anyhow not defined for the architectures that implement tlb_start/end_vma(). No functional change intended. Signed-off-by: Nadav Amit Cc: Andrea Arcangeli Cc: Andrew Morton Cc: Andy Lutomirski Cc: Dave Hansen Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Will Deacon Cc: Yu Zhao Cc: Nick Piggin Cc: linux-c...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s...@vger.kernel.org Cc: x...@kernel.org --- arch/csky/Kconfig | 1 + arch/csky/include/asm/tlb.h | 12 arch/powerpc/Kconfig| 1 + arch/powerpc/include/asm/tlb.h | 2 -- arch/s390/Kconfig | 1 + arch/s390/include/asm/tlb.h | 3 --- arch/sparc/Kconfig | 1 + arch/sparc/include/asm/tlb_64.h | 2 -- arch/x86/Kconfig| 1 + arch/x86/include/asm/tlb.h | 3 --- include/asm-generic/tlb.h | 15 +-- init/Kconfig| 8 12 files changed, 18 insertions(+), 32 deletions(-) diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig index 89dd2fcf38fa..924ff5721240 100644 --- a/arch/csky/Kconfig +++ b/arch/csky/Kconfig @@ -8,6 +8,7 @@ config CSKY select ARCH_HAS_SYNC_DMA_FOR_DEVICE select ARCH_USE_BUILTIN_BSWAP select ARCH_USE_QUEUED_RWLOCKS if NR_CPUS>2 + select ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING select ARCH_WANT_FRAME_POINTERS if !CPU_CK610 select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT select COMMON_CLK diff --git a/arch/csky/include/asm/tlb.h b/arch/csky/include/asm/tlb.h index fdff9b8d70c8..8130a5f09a6b 100644 --- a/arch/csky/include/asm/tlb.h +++ b/arch/csky/include/asm/tlb.h @@ -6,18 +6,6 @@ #include -#define tlb_start_vma(tlb, vma) \ - do { \ - if (!(tlb)->fullmm) \ - flush_cache_range(vma, (vma)->vm_start, (vma)->vm_end); \ - } while (0) - -#define tlb_end_vma(tlb, vma) \ - do { \ - if (!(tlb)->fullmm) \ - flush_tlb_range(vma, (vma)->vm_start, (vma)->vm_end); \ - } while (0) - #define tlb_flush(tlb) flush_tlb_mm((tlb)->mm) #include diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 107bb4319e0e..d9761b6f192a 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -151,6 +151,7 @@ config PPC select ARCH_USE_CMPXCHG_LOCKREF if PPC64 select ARCH_USE_QUEUED_RWLOCKS if PPC_QUEUED_SPINLOCKS select ARCH_USE_QUEUED_SPINLOCKSif PPC_QUEUED_SPINLOCKS + select ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING select ARCH_WANT_IPC_PARSE_VERSION select ARCH_WANT_IRQS_OFF_ACTIVATE_MM select ARCH_WANT_LD_ORPHAN_WARN diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h index 160422a439aa..880b7daf904e 100644 --- a/arch/powerpc/include/asm/tlb.h +++ b/arch/powerpc/include/asm/tlb.h @@ -19,8 +19,6 @@ #include -#define tlb_start_vma(tlb, vma)do { } while (0) -#define tlb_end_vma(tlb, vma) do { } while (0) #define __tlb_remove_tlb_entry __tlb_remove_tlb_entry #define tlb_flush tlb_flush diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index c72874f09741..5b3dc5ca9873 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -113,6 +113,7 @@ config S390 select ARCH_USE_BUILTIN_BSWAP select ARCH_USE_CMPXCHG_LOCKREF select ARCH_WANTS_DYNAMIC_TASK_STRUCT + select ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING select ARCH_WANT_DEFAULT_BPF_JIT select ARCH_WANT_IPC_PARSE_VERSION select BUILDTIME_TABLE_SORT diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h index 954fa8ca6cbd..03f31d59f97c 100644 --- a/arch/s390/include/asm/tlb.h +++ b/arch/s390/include/asm/tlb.h @@ -27,9 +27,6 @@ static inline void tlb_flush(struct mmu_gather *tlb); static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
Re: [RFC 00/20] TLB batching consolidation and enhancements
> On Jan 30, 2021, at 4:39 PM, Andy Lutomirski wrote: > > On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit wrote: >> From: Nadav Amit >> >> There are currently (at least?) 5 different TLB batching schemes in the >> kernel: >> >> 1. Using mmu_gather (e.g., zap_page_range()). >> >> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the >> ongoing deferred TLB flush and flushing the entire range eventually >> (e.g., change_protection_range()). >> >> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?). >> >> 4. Batching per-table flushes (move_ptes()). >> >> 5. By setting a flag on that a deferred TLB flush operation takes place, >> flushing when (try_to_unmap_one() on x86). > > Are you referring to the arch_tlbbatch_add_mm/flush mechanism? Yes.
Re: [RFC 00/20] TLB batching consolidation and enhancements
> On Jan 30, 2021, at 7:30 PM, Nicholas Piggin wrote: > > Excerpts from Nadav Amit's message of January 31, 2021 10:11 am: >> From: Nadav Amit >> >> There are currently (at least?) 5 different TLB batching schemes in the >> kernel: >> >> 1. Using mmu_gather (e.g., zap_page_range()). >> >> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the >> ongoing deferred TLB flush and flushing the entire range eventually >> (e.g., change_protection_range()). >> >> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?). >> >> 4. Batching per-table flushes (move_ptes()). >> >> 5. By setting a flag on that a deferred TLB flush operation takes place, >> flushing when (try_to_unmap_one() on x86). >> >> It seems that (1)-(4) can be consolidated. In addition, it seems that >> (5) is racy. It also seems there can be many redundant TLB flushes, and >> potentially TLB-shootdown storms, for instance during batched >> reclamation (using try_to_unmap_one()) if at the same time mmu_gather >> defers TLB flushes. >> >> More aggressive TLB batching may be possible, but this patch-set does >> not add such batching. The proposed changes would enable such batching >> in a later time. >> >> Admittedly, I do not understand how things are not broken today, which >> frightens me to make further batching before getting things in order. >> For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes >> for each page-table (but not in greater granularity). Can't >> ClearPageDirty() be called before the flush, causing writes after >> ClearPageDirty() and before the flush to be lost? > > Because it's holding the page table lock which stops page_mkclean from > cleaning the page. Or am I misunderstanding the question? Thanks. I understood this part. Looking again at the code, I now understand my confusion: I forgot that the reverse mapping is removed after the PTE is zapped. Makes me wonder whether it is ok to defer the TLB flush to tlb_finish_mmu(), by performing set_page_dirty() for the batched pages when needed in tlb_finish_mmu() [ i.e., by marking for each batched page whether set_page_dirty() should be issued for that page while collecting them ]. > I'll go through the patches a bit more closely when they all come > through. Sparc and powerpc of course need the arch lazy mode to get > per-page/pte information for operations that are not freeing pages, > which is what mmu gather is designed for. IIUC you mean any PTE change requires a TLB flush. Even setting up a new PTE where no previous PTE was set, right? > I wouldn't mind using a similar API so it's less of a black box when > reading generic code, but it might not quite fit the mmu gather API > exactly (most of these paths don't want a full mmu_gather on stack). I see your point. It may be possible to create two mmu_gather structs: a small one that only holds the flush information and another that also holds the pages. >> This patch-set therefore performs the following changes: >> >> 1. Change mprotect, task_mmu and mapping_dirty_helpers to use mmu_gather >> instead of {inc|dec}_tlb_flush_pending(). >> >> 2. Avoid TLB flushes if PTE permission is not demoted. >> >> 3. Cleans up mmu_gather to be less arch-dependant. >> >> 4. Uses mm's generations to track in finer granularity, either per-VMA >> or per page-table, whether a pending mmu_gather operation is >> outstanding. This should allow to avoid some TLB flushes when KSM or >> memory reclamation takes place while another operation such as >> munmap() or mprotect() is running. >> >> 5. Changes try_to_unmap_one() flushing scheme, as the current seems >> broken to track in a bitmap which CPUs have outstanding TLB flushes >> instead of having a flag. > > Putting fixes first, and cleanups and independent patches (like #2) next > would help with getting stuff merged and backported. I tried to do it mostly this way. There are some theoretical races which I did not manage (or try hard enough) to create, so I did not include in the “fixes” section. I will restructure the patch-set according to the feedback. Thanks, Nadav
Re: [RFC 00/20] TLB batching consolidation and enhancements
> On Jan 30, 2021, at 11:57 PM, Nadav Amit wrote: > >> On Jan 30, 2021, at 7:30 PM, Nicholas Piggin wrote: >> >> Excerpts from Nadav Amit's message of January 31, 2021 10:11 am: >>> From: Nadav Amit >>> >>> There are currently (at least?) 5 different TLB batching schemes in the >>> kernel: >>> >>> 1. Using mmu_gather (e.g., zap_page_range()). >>> >>> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the >>> ongoing deferred TLB flush and flushing the entire range eventually >>> (e.g., change_protection_range()). >>> >>> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?). >>> >>> 4. Batching per-table flushes (move_ptes()). >>> >>> 5. By setting a flag on that a deferred TLB flush operation takes place, >>> flushing when (try_to_unmap_one() on x86). >>> >>> It seems that (1)-(4) can be consolidated. In addition, it seems that >>> (5) is racy. It also seems there can be many redundant TLB flushes, and >>> potentially TLB-shootdown storms, for instance during batched >>> reclamation (using try_to_unmap_one()) if at the same time mmu_gather >>> defers TLB flushes. >>> >>> More aggressive TLB batching may be possible, but this patch-set does >>> not add such batching. The proposed changes would enable such batching >>> in a later time. >>> >>> Admittedly, I do not understand how things are not broken today, which >>> frightens me to make further batching before getting things in order. >>> For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes >>> for each page-table (but not in greater granularity). Can't >>> ClearPageDirty() be called before the flush, causing writes after >>> ClearPageDirty() and before the flush to be lost? >> >> Because it's holding the page table lock which stops page_mkclean from >> cleaning the page. Or am I misunderstanding the question? > > Thanks. I understood this part. Looking again at the code, I now understand > my confusion: I forgot that the reverse mapping is removed after the PTE is > zapped. > > Makes me wonder whether it is ok to defer the TLB flush to tlb_finish_mmu(), > by performing set_page_dirty() for the batched pages when needed in > tlb_finish_mmu() [ i.e., by marking for each batched page whether > set_page_dirty() should be issued for that page while collecting them ]. Correcting myself (I hope): no we cannot do so, since the buffers might be remove from the page at that point.
Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()
> On Feb 1, 2021, at 10:41 PM, Nicholas Piggin wrote: > > Excerpts from Peter Zijlstra's message of February 1, 2021 10:09 pm: >> I also don't think AGRESSIVE_FLUSH_BATCHING quite captures what it does. >> How about: >> >> CONFIG_MMU_GATHER_NO_PER_VMA_FLUSH > > Yes please, have to have descriptive names. Point taken. I will fix it. > > I didn't quite see why this was much of an improvement though. Maybe > follow up patches take advantage of it? I didn't see how they all fit > together. They do, but I realized as I said in other emails that I have a serious bug in the deferred invalidation scheme. Having said that, I think there is an advantage of having an explicit config option instead of relying on whether tlb_end_vma is defined. For instance, Arm does not define tlb_end_vma, and consequently it flushes the TLB after each VMA. I suspect it is not intentional.
Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()
> On Feb 2, 2021, at 1:31 AM, Peter Zijlstra wrote: > > On Tue, Feb 02, 2021 at 07:20:55AM +0000, Nadav Amit wrote: >> Arm does not define tlb_end_vma, and consequently it flushes the TLB after >> each VMA. I suspect it is not intentional. > > ARM is one of those that look at the VM_EXEC bit to explicitly flush > ITLB IIRC, so it has to. Hmm… I don’t think Arm is doing that. At least arm64 does not use the default tlb_flush(), and it does not seem to consider VM_EXEC (at least in this path): static inline void tlb_flush(struct mmu_gather *tlb) { struct vm_area_struct vma = TLB_FLUSH_VMA(tlb->mm, 0); bool last_level = !tlb->freed_tables; unsigned long stride = tlb_get_unmap_size(tlb); int tlb_level = tlb_get_level(tlb); /* * If we're tearing down the address space then we only care about * invalidating the walk-cache, since the ASID allocator won't * reallocate our ASID without invalidating the entire TLB. */ if (tlb->mm_exiting) { if (!last_level) flush_tlb_mm(tlb->mm); return; } __flush_tlb_range(&vma, tlb->start, tlb->end, stride, last_level, tlb_level); }
Re: [PATCH 16/16] mm: pass get_user_pages_fast iterator arguments in a structure
> On Jun 11, 2019, at 5:52 PM, Nicholas Piggin wrote: > > Christoph Hellwig's on June 12, 2019 12:41 am: >> Instead of passing a set of always repeated arguments down the >> get_user_pages_fast iterators, create a struct gup_args to hold them and >> pass that by reference. This leads to an over 100 byte .text size >> reduction for x86-64. > > What does this do for performance? I've found this pattern can be > bad for store aliasing detection. Note that sometimes such an optimization can also have adverse effect due to stack protector code that gcc emits when you use such structs. Matthew Wilcox encountered such a case: https://patchwork.kernel.org/patch/10702741/
Re: [Pv-drivers] [PATCH 29/36] cpuidle, xenpv: Make more PARAVIRT_XXL noinstr clean
�ۚ�,ڶ*'�+-�X���wZ�*'�� jg��m�i^�j�gz�!��(���z�h��&��j{��w���r��rkۑ� ���r��rkۑ� ���r���'��*�v)�f���&�yا� ��W(�G���qz}'��z\^�I�jg��''y�ڎꮉȧq�&�蜝�j;��'"��(�X��7�Ib��l��/���z�ޖ���!�蝭�aj�(���w�v\�h�z ��,�)'�^��g� �b��k�x�u�j�.��첋�q����+���z�,���y�+���}�-z�f���&}�-z�f���&u�ez�&�םzY^� �u�Z�'jz%�vu��zY^�Ƨ�X�j�(�蝫ajxg�'bi�&��ڶ��{�v&��h\jX�����&�ƥ�{�i^���w���%zv�z�ޖ��)��^�ƥ�{��ק��+�X��mz{�)��^�ƨ��Ƥz�ޖljG���h�(�X�����&���{�ib��Z�i����G���h� b��Z�i����G���h�.u穆�ei��r���מ�%�&�)��n��X���b��f��(� b��f��%��l�)��n+��&j)\�k!��ނf���&�)�Ū���zYb��"���u杢�%�{�j��z�ޖX��ȧ~��y�h�!�+3jy�w�r�6���gz��'dz�ޖ�౺2vG���h�b� �zy�w����x.���z�ޖ��nky�Z��&nky�Z��&r���j)�w�����ṧ�G���h��{�u��+!�)�{��{^��&jW�jw^��b�"�X�����\�iz�^m�r��my���&�y�&�)�ƶ���/�Y^���vIb��kjɮ���z�d��}��jw^�}��jw^���Z)e 朢|"�Y�w������:�k��$ɺ+��,��/�L���%y�&�z�ޖ��z�ޖ��)���$���G���h� b��\�Lz�ޖ��>�k���ݮ+ޮ�r���|u�(��'ɫh�'^r���{��zlj�%��l�w�iךv��)���鱪ܖ+-�)߭�^i�+�ǥ��jy�ˬyףi����jyb��b�ץr��i����jyb��b�ץr���wAz�&jyڮwZ�w[u륖)+�Y`��%zf���&�Yb��%� (�W�j)\�kປZ���zZ+��.�֤z�ޖ!��!�m��#��c��m�*ez�h��z�F�-y�k��^v�(�٢���˯���z�ޖ��˯���z�ޖ��1�a��z왫a��z�y��r*,r��q���蜊w�f�j)�'"��(����u�i���jy��� zwZ��Z~�zX��Z�+�zZ+�X��Z�+�zZ+��ڊwں[h�j)�j�m��%�{�jZaj��G���h� b��Z��Z�zZ+��Z�x.�G���h�!k ຉ�w���j�u�ޚZ�w�u�ޙ֯zih~�֥��%��(�Z�&��&ܢ�zf(��ڥ�^�8�~��y�h�㢹ڝ��'�)ڮ�+���v�u�첉�v�z�'�)ᥬ�ܢk)j�%�{��zZ+��Z��b��o���z�ޖ��)�Ƹ�r�b��"���u杢�%�{��+�X��ȧ~��y�h� '��Ƚ�轩�x�jz/q���'`z ���(�:'j�(��i�W�z:'j�(��i�W�{+��z+��&j)\�l��)讋rZ���u�k��Z���u�W� �Τz�ޖ��)�Ū�)�Ɗ�Ib��Z�ib��h��j ���w�����,�G���h���眱^��nj��y�z���v�Z�Y��G���h� Z�Y��G���h��y���w�����^�'$z�ޖ��ןjy+��bj{,�{�v�jb~+-y���&���'���jV��'⢗��+�+-�X���(��(� )zz��b��%���r��५���f�W��'��(���^�ȟ��ib��mz ھzZ+�X��^��z�ޖ��zZ+�v��+��G���h��v����r���b��b�ץr���^��^�J%�{��{^��&��"�����zZ+�:h�f�zG���h�nz��j��,��r���{-�j'��j'���{-�륊{��*l�zZ+�X��z��w���)jY��֛m�mr��jY��֛m�mr���{���Z�z�ޖ��)���j '�zZ+�+��Z��ۥ�bzz:!jyȩ��n�*'�w���Z�w��*l�[�e�{���z�b��i��^�X���3��좸��+�:%�{���z��w��r�
Re: [PATCH v2 1/2] mm/migrate_device.c: Copy pte dirty bit to page
On Aug 17, 2022, at 12:17 AM, Huang, Ying wrote: > Alistair Popple writes: > >> Peter Xu writes: >> >>> On Wed, Aug 17, 2022 at 11:49:03AM +1000, Alistair Popple wrote: Peter Xu writes: > On Tue, Aug 16, 2022 at 04:10:29PM +0800, huang ying wrote: >>> @@ -193,11 +194,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, >>>bool anon_exclusive; >>>pte_t swp_pte; >>> >>> + flush_cache_page(vma, addr, pte_pfn(*ptep)); >>> + pte = ptep_clear_flush(vma, addr, ptep); >> >> Although I think it's possible to batch the TLB flushing just before >> unlocking PTL. The current code looks correct. > > If we're with unconditionally ptep_clear_flush(), does it mean we should > probably drop the "unmapped" and the last flush_tlb_range() already since > they'll be redundant? This patch does that, unless I missed something? >>> >>> Yes it does. Somehow I didn't read into the real v2 patch, sorry! >>> > If that'll need to be dropped, it looks indeed better to still keep the > batch to me but just move it earlier (before unlock iiuc then it'll be > safe), then we can keep using ptep_get_and_clear() afaiu but keep "pte" > updated. I think we would also need to check should_defer_flush(). Looking at try_to_unmap_one() there is this comment: if (should_defer_flush(mm, flags) && !anon_exclusive) { /* * We clear the PTE but do not flush so potentially * a remote CPU could still be writing to the folio. * If the entry was previously clean then the * architecture must guarantee that a clear->dirty * transition on a cached TLB entry is written through * and traps if the PTE is unmapped. */ And as I understand it we'd need the same guarantee here. Given try_to_migrate_one() doesn't do batched TLB flushes either I'd rather keep the code as consistent as possible between migrate_vma_collect_pmd() and try_to_migrate_one(). I could look at introducing TLB flushing for both in some future patch series. >>> >>> should_defer_flush() is TTU-specific code? >> >> I'm not sure, but I think we need the same guarantee here as mentioned >> in the comment otherwise we wouldn't see a subsequent CPU write that >> could dirty the PTE after we have cleared it but before the TLB flush. >> >> My assumption was should_defer_flush() would ensure we have that >> guarantee from the architecture, but maybe there are alternate/better >> ways of enforcing that? >>> IIUC the caller sets TTU_BATCH_FLUSH showing that tlb can be omitted since >>> the caller will be responsible for doing it. In migrate_vma_collect_pmd() >>> iiuc we don't need that hint because it'll be flushed within the same >>> function but just only after the loop of modifying the ptes. Also it'll be >>> with the pgtable lock held. >> >> Right, but the pgtable lock doesn't protect against HW PTE changes such >> as setting the dirty bit so we need to ensure the HW does the right >> thing here and I don't know if all HW does. > > This sounds sensible. But I take a look at zap_pte_range(), and find > that it appears that the implementation requires the PTE dirty bit to be > write-through. Do I miss something? > > Hi, Nadav, Can you help? Sorry for joining the discussion late. I read most ofthis thread and I hope I understand what you ask me. So at the risk of rehashing or repeating what you already know - here are my 2 cents. Feel free to ask me again if I did not understand your questions: 1. ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH is currently x86 specific. There is a recent patch that want to use it for arm64 as well [1]. The assumption that Alistair cited from the code (regarding should_defer_flush()) might not be applicable to certain architectures (although most likely it is). I tried to encapsulate the logic on whether an unflushed RO entry can become dirty in an arch specific function [2]. 2. Having said all of that, using the logic of “flush if there are pending TLB flushes for this mm” as done by UNMAP_TLB_FLUSH makes sense IMHO (although I would have considered doing it in finer granularity of VMA/page-table as I proposed before and got somewhat lukewarm response [3]). 3. There is no question that flushing after dropping the ptl is wrong. But reading the thread, I think that you only focus on whether a dirty indication might get lost. The problem, I think, is bigger, as it might also cause correction problems after concurrently removing mappings. What happens if we get for a clean PTE something like: CPU0