On Mon, Oct 9, 2017 at 10:36 AM, Borislav Petkov <b...@alien8.de> wrote: > On Mon, Oct 09, 2017 at 07:02:31PM +0200, Borislav Petkov wrote: >> From 29a6426c20f25b8df767356a04727cb113e8e784 Mon Sep 17 00:00:00 2001 >> From: Andy Lutomirski <l...@kernel.org> >> Date: Mon, 9 Oct 2017 09:50:49 -0700 >> Subject: [PATCH] x86/mm: Flush more aggressively in lazy TLB mode >> >> Since commit 94b1b03b519b, x86's lazy TLB mode has been all the way > > Write it out: > > Since commit > > 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
Will do. > > x86's lazy... > >> lazy: when running a kernel thread (including the idle thread), the >> kernel keeps using the last user mm's page tables without attempting >> to maintain user TLB coherence at all. From a pure semantic >> perspective, this is fine -- kernel threads won't attempt to access >> user pages, so having stale TLB entries doesn't matter. >> >> Unfortunately, I forgot about a subtlety. By skipping TLB flushes, >> we also allow any paging-structure caches that may exist on the CPU >> to become incoherent. This means that we can have a >> paging-structure cache entry that references a freed page table, and >> the CPU is within its rights to do a speculative page walk starting >> at the freed page table. >> >> I can imagine this causing two different problems: >> >> - A speculative page walk starting from a bogus page table could read >> IO addresses. I haven't seen any reports of this causing problems. >> >> - A speculative page walk that involves a bogus page table can install >> garbage in the TLB. Such garbage would always be at a user VA, but >> some AMD CPUs have logic that triggers a machine check when it notices >> these bogus entries. I've seen a couple reports of this. > > It is actually more of an optimization which assumes that paging-structure > entries are in WB DRAM: > > "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables > performance optimization that assumes PML4, PDP, PDE, and PTE entries > are in cacheable WB-DRAM; memory type checks may be bypassed, and > addresses outside of WB-DRAM may result in undefined behavior or NB > protocol errors. 1=Disables performance optimization and allows PML4, > PDP, PDE and PTE entries to be in any memory type. Operating systems > that maintain page tables in memory types other than WB- DRAM must set > TlbCacheDis to insure proper operation." > > The MCE generated is an NB protocol error to signal that > > "Link: A specific coherent-only packet from a CPU was issued to an > IO link. This may be caused by software which addresses page table > structures in a memory type other than cacheable WB-DRAM without > properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for > example, when page table structure addresses are above top of memory. In > such cases, the NB will generate an MCE if it sees a mismatch between > the memory operation generated by the core and the link type." > > I'm assuming coherent-only packets don't go out on IO links, thus the > error. Makes sense. It's even possible that the address is entirely bogus and doesn't refer to anything at all. > >> Reinstate TLB coherence in lazy mode. With this patch applied, we >> do it in one of two ways. If we have PCID, we simply switch back to >> init_mm's page tables when we enter a kernel thread -- this seems to >> be quite cheap except for the cost of serializing the CPU. If we >> don't have PCID, then we set a flag and switch to init_mm the first >> time we would otherwise need to flush the TLB. >> >> /sys/kernel/debug/x86/tlb_use_lazy_mode can be changed to override >> the default mode for benchmarking. > > So this should be tlb_full_lazy_mode, no? > > Because we are lazy before but not "all the way" as you say above... The choices are somewhat lazy and not lazy at all. > > And frankly, I'm not really crazy about this benchmarking aspect. Why > would you have this in every kernel? I mean, it could be a patch ontop > for people to measure on boxes but it is not really worth it. I mean, > the PCID fun only showed some small improvement in some microbenchmark > and nothing earth-shattering so having it everywhere to get some meh > numbers is not really worth the trouble. > > Besides, this TLB code is already complex as hell. The degree of simplification I would get by removing it is basically nil. The debugfs code itself goes away, and a static_branch_unlikely() turns into a static_cpu_has(), and that's it. The real reason I added it is because Chris Mason volunteered to benchmark it, and I'll send it to him once it survives a bit of review. > > IOW, you could simplify this by removing the static key and adding it > with a patch ontop for people who wanna play with this. For the majority > of the systems and use cases it doesn't really matter, IMO. > >> In theory, we could optimize this better by only flushing the TLB in >> lazy CPUs when a page table is freed. Doing that would require >> auditing the mm code to make sure that all page table freeing goes >> through tlb_remove_page() as well as reworking some data structures >> to implement the improved flush logic. >> >> Fixes: 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness >> tracking") >> Reported-by: Markus Trippelsdorf <mar...@trippelsdorf.de> >> Reported-by: Adam Borowski <kilob...@angband.pl> >> Cc: Brian Gerst <brge...@gmail.com> >> Cc: Borislav Petkov <b...@alien8.de> >> Signed-off-by: Andy Lutomirski <l...@kernel.org> >> Cc: "H. Peter Anvin" <h...@zytor.com> >> Cc: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com> >> Cc: Adam Borowski <kilob...@angband.pl> >> Cc: Andy Lutomirski <l...@kernel.org> >> Cc: Borislav Petkov <b...@alien8.de> >> Cc: Borislav Petkov <b...@suse.de> >> Cc: Brian Gerst <brge...@gmail.com> >> Cc: Daniel Borkmann <dan...@iogearbox.net> >> Cc: Eric Biggers <ebigg...@google.com> >> Cc: Ingo Molnar <mi...@redhat.com> >> Cc: Kees Cook <keesc...@chromium.org> >> Cc: Markus Trippelsdorf <mar...@trippelsdorf.de> >> Cc: Nadav Amit <nadav.a...@gmail.com> >> Cc: Rik van Riel <r...@redhat.com> >> Cc: Roman Kagan <rka...@virtuozzo.com> >> Cc: Thomas Gleixner <t...@linutronix.de> >> Cc: lkml <linux-kernel@vger.kernel.org> >> Cc: x86-ml <x...@kernel.org> >> Link: >> http://lkml.kernel.org/r/8eccc9240041ea7cb94624cab8d07e2a6e911ba7.1507567665.git.l...@kernel.org >> Signed-off-by: Borislav Petkov <b...@suse.de> >> --- >> arch/x86/include/asm/mmu_context.h | 8 +- >> arch/x86/include/asm/tlbflush.h | 24 ++++++ >> arch/x86/mm/tlb.c | 153 >> +++++++++++++++++++++++++++---------- >> 3 files changed, 136 insertions(+), 49 deletions(-) >> >> diff --git a/arch/x86/include/asm/mmu_context.h >> b/arch/x86/include/asm/mmu_context.h >> index c120b5db178a..3c856a15b98e 100644 >> --- a/arch/x86/include/asm/mmu_context.h >> +++ b/arch/x86/include/asm/mmu_context.h >> @@ -126,13 +126,7 @@ static inline void switch_ldt(struct mm_struct *prev, >> struct mm_struct *next) >> DEBUG_LOCKS_WARN_ON(preemptible()); >> } >> >> -static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct >> *tsk) >> -{ >> - int cpu = smp_processor_id(); >> - >> - if (cpumask_test_cpu(cpu, mm_cpumask(mm))) >> - cpumask_clear_cpu(cpu, mm_cpumask(mm)); >> -} >> +void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk); >> >> static inline int init_new_context(struct task_struct *tsk, >> struct mm_struct *mm) >> diff --git a/arch/x86/include/asm/tlbflush.h >> b/arch/x86/include/asm/tlbflush.h >> index 4893abf7f74f..d362161d3291 100644 >> --- a/arch/x86/include/asm/tlbflush.h >> +++ b/arch/x86/include/asm/tlbflush.h >> @@ -83,6 +83,13 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm) >> #endif >> >> /* >> + * If tlb_use_lazy_mode is true, then we try to avoid switching CR3 to point >> + * to init_mm when we switch to a kernel thread (e.g. the idle thread). If >> + * it's false, then we immediately switch CR3 when entering a kernel thread. >> + */ >> +DECLARE_STATIC_KEY_TRUE(tlb_use_lazy_mode); >> + >> +/* >> * 6 because 6 should be plenty and struct tlb_state will fit in >> * two cache lines. >> */ >> @@ -105,6 +112,23 @@ struct tlb_state { >> u16 next_asid; >> >> /* >> + * We can be in one of several states: >> + * >> + * - Actively using an mm. Our CPU's bit will be set in >> + * mm_cpumask(loaded_mm) and is_lazy == false; >> + * >> + * - Not using a real mm. loaded_mm == &init_mm. Our CPU's bit >> + * will not be set in mm_cpumask(&init_mm) and is_lazy == false. > > And this is the old lazy mode, right? > > I think we should state what lazy we mean ... This is non-lazy. It's roughtly what our state was in old kernels when we went lazy and then called leave_mm(). > >> + * - Lazily using a real mm. loaded_mm != &init_mm, our bit >> + * is set in mm_cpumask(loaded_mm), but is_lazy == true. >> + * We're heuristically guessing that the CR3 load we >> + * skipped more than makes up for the overhead added by >> + * lazy mode. >> + */ >> + bool is_lazy; > > >> + >> + /* >> * Access to this CR4 shadow and to H/W CR4 is protected by >> * disabling interrupts when modifying either one. >> */ >> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c >> index 49d9778376d7..658bf0090565 100644 >> --- a/arch/x86/mm/tlb.c >> +++ b/arch/x86/mm/tlb.c >> @@ -30,6 +30,8 @@ >> >> atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1); >> >> +DEFINE_STATIC_KEY_TRUE(tlb_use_lazy_mode); >> + >> static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen, >> u16 *new_asid, bool *need_flush) >> { >> @@ -80,7 +82,7 @@ void leave_mm(int cpu) >> return; >> >> /* Warn if we're not lazy. */ >> - WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))); >> + WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy)); >> >> switch_mm(NULL, &init_mm, NULL); >> } >> @@ -142,45 +144,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct >> mm_struct *next, >> __flush_tlb_all(); >> } >> #endif >> + this_cpu_write(cpu_tlbstate.is_lazy, false); >> >> if (real_prev == next) { >> VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) != >> next->context.ctx_id); >> >> - if (cpumask_test_cpu(cpu, mm_cpumask(next))) { >> - /* >> - * There's nothing to do: we weren't lazy, and we >> - * aren't changing our mm. We don't need to flush >> - * anything, nor do we need to update CR3, CR4, or >> - * LDTR. >> - */ >> - return; >> - } >> - >> - /* Resume remote flushes and then read tlb_gen. */ >> - cpumask_set_cpu(cpu, mm_cpumask(next)); >> - next_tlb_gen = atomic64_read(&next->context.tlb_gen); >> - >> - if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) < >> - next_tlb_gen) { >> - /* >> - * Ideally, we'd have a flush_tlb() variant that >> - * takes the known CR3 value as input. This would >> - * be faster on Xen PV and on hypothetical CPUs >> - * on which INVPCID is fast. >> - */ >> - this_cpu_write(cpu_tlbstate.ctxs[prev_asid].tlb_gen, >> - next_tlb_gen); >> - write_cr3(build_cr3(next, prev_asid)); >> - trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, >> - TLB_FLUSH_ALL); >> - } >> - >> /* >> - * We just exited lazy mode, which means that CR4 and/or LDTR >> - * may be stale. (Changes to the required CR4 and LDTR states >> - * are not reflected in tlb_gen.) >> + * We don't currently support having a real mm loaded without >> + * our cpu set in mm_cpumask(). We have all the bookkeeping > > s/cpu/CPU/ > >> + * in place to figure out whether we would need to flush >> + * if our cpu were cleared in mm_cpumask(), but we don't > > ditto. > >> + * currently use it. >> */ >> + if (WARN_ON_ONCE(real_prev != &init_mm && >> + !cpumask_test_cpu(cpu, mm_cpumask(next)))) >> + cpumask_set_cpu(cpu, mm_cpumask(next)); >> + >> + return; >> } else { >> u16 new_asid; >> bool need_flush; > -- > Regards/Gruss, > Boris. > > Good mailing practices for 400: avoid top-posting and trim the reply.