Re: [PATCH] powerpc/stacktrace: fix raise_backtrace_ipi() logic

Michael Ellerman Wed, 02 Jun 2021 04:08:37 -0700

Nathan Lynch <nath...@linux.ibm.com> writes:
> When smp_send_safe_nmi_ipi() indicates that the target CPU has
> responded to the IPI, skip the remote paca inspection
> fallback. Otherwise both the sending and target CPUs attempt the
> backtrace, usually creating a misleading ("didn't respond to backtrace
> IPI" is wrong) and interleaved mess:


Thanks for fixing my bugs for me :)

> [ 1658.929157][    C7] rcu: Stack dump where RCU GP kthread last ran:
> [ 1658.929223][    C7] Sending NMI from CPU 7 to CPUs 1:
> [ 1658.929303][    C1] NMI backtrace for cpu 1
> [ 1658.929303][    C7] CPU 1 didn't respond to backtrace IPI, inspecting paca.
> [ 1658.929362][    C1] CPU: 1 PID: 325 Comm: kworker/1:1H Tainted: G        W 
>   E     5.13.0-rc2+ #46
> [ 1658.929405][    C7] irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 325 
> (kworker/1:1H)
> [ 1658.929465][    C1] Workqueue: events_highpri test_work_fn [test_lockup]
> [ 1658.929549][    C7] Back trace of paca->saved_r1 (0xc0000000057fb400) 
> (possibly stale):
> [ 1658.929592][    C1] NIP:  c00000000002cf50 LR: c008000000820178 CTR: 
> c00000000002cfa0
>
> Verified using the test_lockup module, e.g.
>
> $ echo 5 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
> $ insmod test_lockup.ko time_secs=1 iterations=10 state=R lock_rcu \
>   touch_softlockup all_cpus
>
> Fixes: 5cc05910f26e ("powerpc/64s: Wire up arch_trigger_cpumask_backtrace()")
> Signed-off-by: Nathan Lynch <nath...@linux.ibm.com>
> ---
>  arch/powerpc/kernel/stacktrace.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/kernel/stacktrace.c 
> b/arch/powerpc/kernel/stacktrace.c
> index 1deb1bf331dd..e0ccc5a46d7e 100644
> --- a/arch/powerpc/kernel/stacktrace.c
> +++ b/arch/powerpc/kernel/stacktrace.c
> @@ -174,11 +174,14 @@ static void raise_backtrace_ipi(cpumask_t *mask)
>  {
>       unsigned int cpu;
>  
> +     if (cpumask_test_cpu(smp_processor_id(), mask)) {
> +             handle_backtrace_ipi(NULL);
> +             cpumask_clear_cpu(smp_processor_id(), mask);
> +     }
> +
>       for_each_cpu(cpu, mask) {
> -             if (cpu == smp_processor_id())
> -                     handle_backtrace_ipi(NULL);
> -             else
> -                     smp_send_safe_nmi_ipi(cpu, handle_backtrace_ipi, 5 * 
> USEC_PER_SEC);
> +             if (smp_send_safe_nmi_ipi(cpu, handle_backtrace_ipi, 5 * 
> USEC_PER_SEC))
> +                     cpumask_clear_cpu(cpu, mask);

I think there's still a race here, but instead of causing us to emit a
spurious "didn't respond" trace, it could lead to us failing to emit a
proper trace when we should.

It's hard to follow this code, but mask above is backtrace_mask from
lib/nmi_backtrace.c, because of:

void nmi_trigger_cpumask_backtrace(const cpumask_t *mask,
                                   bool exclude_self,
                                   void (*raise)(cpumask_t *mask))
{
        int i, this_cpu = get_cpu();

        cpumask_copy(to_cpumask(backtrace_mask), mask);
                                ^^^^^^^^^^^^^^
        ...

        if (!cpumask_empty(to_cpumask(backtrace_mask))) {
                pr_info("Sending NMI from CPU %d to CPUs %*pbl:\n",
                        this_cpu, nr_cpumask_bits, to_cpumask(backtrace_mask));
                raise(to_cpumask(backtrace_mask));
                                 ^^^^^^^^^^^^^^

And raise there is raise_backtrace_ipi() (the function we're patching).

On the receiving CPU we end up executing:

int smp_handle_nmi_ipi(struct pt_regs *regs)
{
        ...
        nmi_ipi_lock_start(&flags);
        if (cpumask_test_cpu(me, &nmi_ipi_pending_mask)) {
                cpumask_clear_cpu(me, &nmi_ipi_pending_mask);
                fn = READ_ONCE(nmi_ipi_function);
                ...
        }
        nmi_ipi_unlock_end(&flags);

        if (fn)
                fn(regs);


The key detail being that we drop the nmi lock before calling fn, which
means the calling CPU can return back to raise_backtrace_ipi() before fn
is called.

In our case fn is handle_backtrace_ipi() which just calls nmi_cpu_backtrace().

Which does:

bool nmi_cpu_backtrace(struct pt_regs *regs)
{
        int cpu = smp_processor_id();

        if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
                ...
                pr_warn("NMI backtrace for cpu %d\n", cpu);
                if (regs)
                        show_regs(regs);
                else
                        dump_stack();


ie. if the CPU has been cleared from backtrace_mask it doesn't emit a
stack trace.

So we could end up with the following interleaving:


CPU0                                                            CPU1
====                                                            ====
if (smp_send_safe_nmi_ipi(cpu, handle_backtrace_ipi, ...
                                                                // 
smp_handle_nmi_ipi()
                                                                fn = 
READ_ONCE(nmi_ipi_function);
                                                                ...
                                                                
nmi_ipi_unlock_end(&flags);
// in smp_send_safe_nmi_ipi()
nmi_ipi_lock();
while (!cpumask_empty(&nmi_ipi_pending_mask)) {
...
nmi_ipi_unlock_end(&flags);
return ret;

        cpumask_clear_cpu(cpu, mask);
                                                                fn(regs)
                                                                // -> 
nmi_cpu_backtrace()
                                                                if 
(cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {


But like I said, it's not easy to follow, so maybe I missed something
along the way.


To solve it I think we want to avoid clearing a CPU from the mask unless
we know that the IPI failed for that CPU. That way there's no risk of
suppressing a trace from a CPU that successfully handles the IPI, and we
know we've waited 5 seconds for CPUs that fail to handle the IPI.

I don't think we want to allocate a whole new cpumask to track which
CPUs have failed to respond, but I don't think we need to. We can just
synchronously handle them.

Something like below.

cheers


diff --git a/arch/powerpc/kernel/stacktrace.c b/arch/powerpc/kernel/stacktrace.c
index 1deb1bf331dd..980e87f7ae7a 100644
--- a/arch/powerpc/kernel/stacktrace.c
+++ b/arch/powerpc/kernel/stacktrace.c
@@ -172,17 +172,19 @@ static void handle_backtrace_ipi(struct pt_regs *regs)
 
 static void raise_backtrace_ipi(cpumask_t *mask)
 {
+       struct paca_struct *p;
        unsigned int cpu;
 
        for_each_cpu(cpu, mask) {
-               if (cpu == smp_processor_id())
+               if (cpu == smp_processor_id()) {
                        handle_backtrace_ipi(NULL);
-               else
-                       smp_send_safe_nmi_ipi(cpu, handle_backtrace_ipi, 5 * 
USEC_PER_SEC);
-       }
+                       continue;
+               }
 
-       for_each_cpu(cpu, mask) {
-               struct paca_struct *p = paca_ptrs[cpu];
+               if (smp_send_safe_nmi_ipi(cpu, handle_backtrace_ipi, 5 * 
USEC_PER_SEC))
+                       continue;
+
+               p = paca_ptrs[cpu];
 
                cpumask_clear_cpu(cpu, mask);

Re: [PATCH] powerpc/stacktrace: fix raise_backtrace_ipi() logic

Reply via email to