On Thu, Jun 11, 2020 at 4:53 PM Paul E. McKenney <paul...@kernel.org> wrote: > > RCU needs to detect when one if its interrupt handlers interrupted an idle > state, where an idle state is either the idle loop itself or nohz_full > userspace execution. When a CPU has been interrupted from one of these > idle states, RCU can report a quiescent state, helping the current grace > period to come to an end. > > However, there are optimized kernel-entry paths where handlers that > would normally be executed in interrupt context are instead executed > directly from the base process context, but with interrupts disabled. > When such a directly-executed handler is followed by softirq processing > (which re-enables interrupts), it is possible that one of RCU's interrupt > handlers will interrupt this softirq processing.
Why is the softirq processing special? ISTM the basic sequence of events is: idle, RCU not watching, IRQs on because we're using a lousy idle mechanism that requires IRQs on. IRQ hits. The ones you're describing are the DEFINE_IDTENTRY_SYSVEC_SIMPLE ones, but I'm not sure why it would matter. IRQ does idtentry_enter_cond_rcu(). idtentry_entry_cond_rcu() sees __rcu_is_watching() return true and does not do rcu_irq_enter(). softirq processing turns on IRQs, a new IRQ hits, and kaboom. But this is a bit of a strange explanation. The SYSVEC_SIMPLE paths don't seem to do irq_exit_rcu(), so there's no softirq processing unless I missed something. But more importantly, what are we doing making it through idtentry_enter_cond_rcu() with rcu not watching at the end? If we hit the idle loop, surely __rcu_is_watching() should have returned false, right? So I added a little warning to detect the case where your patch actually changes something (I don't believe for a second that this warning should be committed -- it's just to get the stack trace): diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index f0b657097a2a..77330b96d8e5 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -592,6 +592,8 @@ bool noinstr idtentry_enter_cond_rcu(struct pt_regs *regs) return true; } + WARN_ON_ONCE(system_state == SYSTEM_RUNNING && is_idle_task(current)); + /* * If RCU is watching then RCU only wants to check * whether it needs to restart the tick in NOHZ And I get this: [ 1.054927] ------------[ cut here ]------------ [ 1.054928] WARNING: CPU: 1 PID: 0 at arch/x86/entry/common.c:595 idtentry_enter_cond_rcu+0x86/0xa0 [ 1.054929] Modules linked in: [ 1.054930] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.7.0+ #145 [ 1.054931] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 [ 1.054931] RIP: 0010:idtentry_enter_cond_rcu+0x86/0xa0 [ 1.054933] Code: 7c 24 08 e8 3c 3e 00 00 e8 57 a5 6e ff e8 d2 49 00 00 84 c0 74 19 90 31 c0 5b c3 65 48 8b 04 25 c0 7e 01 00 f6 40 24 02 74 99 <0f> 0b 90 eb 94 0f 0b 90 eb e2 0f 0b 90 eb 9d 66 66 2e 0f 1f 84 00 [ 1.054933] RSP: 0000:ffffc9000006bc10 EFLAGS: 00010002 [ 1.054934] RAX: ffff88807d529800 RBX: ffffc9000006bc38 RCX: ffffffff81c01327 [ 1.054935] RDX: 0000000000000000 RSI: ffffffff81c00f49 RDI: ffffc9000006bc38 [ 1.054935] RBP: ffffc9000006bc38 R08: 0000000000000000 R09: 0000000000000000 [ 1.054936] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 1.054936] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 1.054937] FS: 0000000000000000(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000 [ 1.054937] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1.054938] CR2: 00000000ffffffff CR3: 0000000002620001 CR4: 0000000000360ee0 [ 1.054938] Call Trace: [ 1.054939] sysvec_call_function_single+0xa/0xd0 [ 1.054939] asm_sysvec_call_function_single+0x31/0x40 [ 1.054940] RIP: 0010:__do_softirq+0xaa/0x437 [ 1.054941] Code: 24 2c 0a 00 00 00 48 89 44 24 10 48 89 44 24 20 44 89 34 24 65 66 c7 05 22 b3 22 7e 00 00 e8 ad dc 40 ff fb 66 0f 1f 44 00 00 <b8> ff ff ff ff 48 c7 c6 00 51 60 82 0f bc 04 24 83 c0 01 49 89 f7 [ 1.054941] RSP: 0000:ffffc9000006bd18 EFLAGS: 00000202 [ 1.054942] RAX: 0000000000001a6e RBX: ffff88807d529800 RCX: 0000000000000000 [ 1.054943] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff81e000a3 [ 1.054944] RBP: ffffc9000006bdc8 R08: 0000000000000000 R09: 0000000000000000 [ 1.054944] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 [ 1.054945] R13: 0000000000000000 R14: 0000000000000082 R15: 0000000000000000 [ 1.054945] ? __do_softirq+0xa3/0x437 [ 1.054945] ? __do_softirq+0xaa/0x437 [ 1.054946] ? __do_softirq+0xa3/0x437 [ 1.054946] ? update_ts_time_stats+0x4c/0x70 [ 1.054947] irq_exit_rcu+0xaf/0xc0 [ 1.054947] sysvec_apic_timer_interrupt+0x51/0xd0 [ 1.054948] asm_sysvec_apic_timer_interrupt+0x31/0x40 [ 1.054948] RIP: 0010:native_safe_halt+0xe/0x10 [ 1.054949] Code: 8b 00 a8 08 74 80 eb c2 cc cc cc cc e9 07 00 00 00 0f 00 2d 46 bf 4e 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d 36 bf 4e 00 fb f4 <c3> cc 41 54 55 53 48 83 ec 10 e8 43 b1 66 ff 65 8b 2d bc 07 4e 7e [ 1.054950] RSP: 0000:ffffc9000006bea0 EFLAGS: 00000206 [ 1.054951] RAX: 0000000000001a69 RBX: 0000000000000001 RCX: 0000000000000000 [ 1.054951] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff81b30789 [ 1.054952] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000 [ 1.054952] R10: 0000000000000001 R11: 0000000000000000 R12: ffff88807d529800 [ 1.054953] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88807d529800 [ 1.054953] ? default_idle+0x19/0x160 [ 1.054954] ? native_safe_halt+0xe/0x10 [ 1.054954] default_idle+0x1e/0x160 [ 1.054955] do_idle+0x1f5/0x270 [ 1.054955] ? _raw_spin_unlock_irqrestore+0x3a/0x50 [ 1.054956] ? trace_hardirqs_on+0x20/0xf0 [ 1.054956] cpu_startup_entry+0x14/0x20 [ 1.054957] start_secondary+0x147/0x170 [ 1.054957] secondary_startup_64+0xa4/0xb0 This is saying we were idle and __rcu_is_watching() correctly returned false. We got sysvec_apic_timer_interrupt(), which did rcu_irq_enter() and then turned on IRQs for softirq processing. Then we got sysvec_call_function_single() inside that, and sysvec_call_function_single() noticed that RCU was already watching and did *not* call rcu_irq_enter(). And, if sysvec_call_function_single() does rcu_is_cpu_rrupt_from_idle(), it will return true. So the issue is that RCU thinks that, since it > +static __always_inline bool rcu_needs_irq_enter(void) > +{ > + return !IS_ENABLED(CONFIG_TINY_RCU) && > + (context_tracking_enabled_cpu(smp_processor_id()) || > is_idle_task(current)); > +} x86 shouldn't need this context tracking check -- we won't even call this if we came from user mode, and we make sure we never run with IRQs on when we're in CONTEXT_USER. I think my preference would be to fix this with something more like tglx's patch but with an explanation: diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index f0b657097a2a..93fd9d6fe033 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -571,7 +571,7 @@ bool noinstr idtentry_enter_cond_rcu(struct pt_regs *regs) return false; } - if (!__rcu_is_watching()) { + if (!__rcu_is_watching() || is_idle_task(current)) { /* * If RCU is not watching then the same careful * sequence vs. lockdep and tracing is required @@ -579,6 +579,18 @@ bool noinstr idtentry_enter_cond_rcu(struct pt_regs *regs) * * This only happens for IRQs that hit the idle * loop, i.e. if idle is not using MWAIT. + * + * We also do this in nested IRQs in the idie task + * (e.g. we get an IRQ, we run softirqs with IRQs on after + * processing the IRQ, and we get another IRQ inside the + * softirq.) This is because RCU wants to be able to tell + * whether it's running in an IRQ context directly inside + * idle (and hence interrupted a quiescent state) or + * interrupted kernel code that was coincidentally in the + * idle task (and hence the kernel was not quiescent). RCU + * does this by counting rcu_irq_enter() levels. This won't + * prevent scheduling in an otherwise schedulable context + * because the idle task can never schedule. */ WARN_ON_ONCE(!(regs->flags & X86_EFLAGS_IF)); lockdep_hardirqs_off(CALLER_ADDR0); Alternatively, perhaps rcu_is_cpu_rrupt_from_idle(void) could do something like: if (softirq_count()) return false; if we believe that the only way this could happen was due to a softirq.