Excerpts from Nicholas Piggin's message of January 20, 2021 1:09 pm: > Excerpts from Athira Rajeev's message of January 19, 2021 8:24 pm: >> >> [ 883.900762] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! >> [swapper/0:0] >> [ 883.901381] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G OE >> 5.11.0-rc3+ #34 >> -- >> [ 883.901999] NIP [c0000000000168d0] replay_soft_interrupts+0x70/0x2f0 >> [ 883.902032] LR [c00000000003b2b8] >> interrupt_exit_kernel_prepare+0x1e8/0x240 >> [ 883.902063] Call Trace: >> [ 883.902085] [c000000001c96f50] [c00000000003b2b8] >> interrupt_exit_kernel_prepare+0x1e8/0x240 (unreliable) >> [ 883.902139] [c000000001c96fb0] [c00000000000fd88] >> interrupt_return+0x158/0x200 >> [ 883.902185] --- interrupt: ea0 at __rb_reserve_next+0xc0/0x5b0 >> [ 883.902224] NIP: c0000000002d8980 LR: c0000000002d897c CTR: >> c0000000001aad90 >> [ 883.902262] REGS: c000000001c97020 TRAP: 0ea0 Tainted: G OE >> (5.11.0-rc3+) >> [ 883.902301] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: >> 28000484 XER: 20040000 >> [ 883.902387] CFAR: c00000000000fe00 IRQMASK: 0 >> -- >> [ 883.902757] NIP [c0000000002d8980] __rb_reserve_next+0xc0/0x5b0 >> [ 883.902786] LR [c0000000002d897c] __rb_reserve_next+0xbc/0x5b0 >> [ 883.902824] --- interrupt: ea0 >> [ 883.902848] [c000000001c97360] [c0000000002d8fcc] >> ring_buffer_lock_reserve+0x15c/0x580 >> [ 883.902894] [c000000001c973f0] [c0000000002e82fc] >> trace_function+0x4c/0x1c0 >> [ 883.902930] [c000000001c97440] [c0000000002f6f50] >> function_trace_call+0x140/0x190 >> [ 883.902976] [c000000001c97470] [c00000000007d6f8] ftrace_call+0x4/0x44 >> [ 883.903021] [c000000001c97660] [c000000000dcf70c] __do_softirq+0x15c/0x3d4 >> [ 883.903066] [c000000001c97750] [c00000000015fc68] irq_exit+0x198/0x1b0 >> [ 883.903102] [c000000001c97780] [c000000000dc1790] >> timer_interrupt+0x170/0x3b0 >> [ 883.903148] [c000000001c977e0] [c000000000016994] >> replay_soft_interrupts+0x134/0x2f0 >> [ 883.903193] [c000000001c979d0] [c00000000003b2b8] >> interrupt_exit_kernel_prepare+0x1e8/0x240 >> [ 883.903240] [c000000001c97a30] [c00000000000fd88] >> interrupt_return+0x158/0x200 >> [ 883.903276] --- interrupt: ea0 at arch_local_irq_restore+0x70/0xc0 > > You got a 0xea0 interrupt in the ftrace code. I wonder where it is > looping. Do you see more soft lockup messages?
We should probably fix this recursion too. I was vaguely aware of it and thought it might have existed with the old interrupt exit and replay code as well and was pretty well bounded, but I'm not entirely sure it's okay. And now that I've thought about it a bit harder, I think there is actualy a simple way to fix it - [PATCH] powerpc/64: prevent replayed interrupt handlers from running softirqs Running softirqs enables interrupts, which can then end up recursing into the irq soft-mask code we're trying to adjust, including replaying interrupts itself which may not be bounded. This abridged trace shows how this can occur: NIP replay_soft_interrupts LR interrupt_exit_kernel_prepare Call Trace: interrupt_exit_kernel_prepare (unreliable) interrupt_return --- interrupt: ea0 at __rb_reserve_next NIP __rb_reserve_next LR __rb_reserve_next Call Trace: ring_buffer_lock_reserve trace_function function_trace_call ftrace_call __do_softirq irq_exit timer_interrupt replay_soft_interrupts interrupt_exit_kernel_prepare interrupt_return --- interrupt: ea0 at arch_local_irq_restore Fix this by disabling bhs (softirqs) around the interrupt replay. Signed-off-by: Nicholas Piggin <npig...@gmail.com> --- arch/powerpc/kernel/irq.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c index 681abb7c0507..bb0d4fc8df89 100644 --- a/arch/powerpc/kernel/irq.c +++ b/arch/powerpc/kernel/irq.c @@ -189,6 +189,18 @@ void replay_soft_interrupts(void) unsigned char happened = local_paca->irq_happened; struct pt_regs regs; + /* + * Prevent softirqs from being run when an interrupt handler returns + * and calls irq_exit(), because softirq processing enables interrupts. + * If an interrupt is taken, it may then call replay_soft_interrupts + * on its way out, which gets messy and recursive. + * + * softirqs created by replayed interrupts will be run at the end of + * this function when bhs are enabled (if they were enabled in our + * caller). + */ + local_bh_disable(); + ppc_save_regs(®s); regs.softe = IRQS_ENABLED; @@ -264,6 +276,8 @@ void replay_soft_interrupts(void) trace_hardirqs_off(); goto again; } + + local_bh_enable(); } notrace void arch_local_irq_restore(unsigned long mask) -- 2.23.0