Re: [External] Re: [PATCH v2] printk: fix deadlock when kernel panic

Muchun Song Tue, 09 Feb 2021 04:22:15 -0800

On Tue, Feb 9, 2021 at 5:19 PM Petr Mladek <pmla...@suse.com> wrote:
>
> On Sat 2021-02-06 13:41:24, Muchun Song wrote:
> > We found a deadlock bug on our server when the kernel panic. It can be
> > described in the following diagram.
> >
> > CPU0:                                         CPU1:
> > panic                                         rcu_dump_cpu_stacks
> >   kdump_nmi_shootdown_cpus                      
> > nmi_trigger_cpumask_backtrace
> >     register_nmi_handler(crash_nmi_callback)      printk_safe_flush
> >                                                     __printk_safe_flush
> >                                                       
> > raw_spin_lock_irqsave(&read_lock)
> >     // send NMI to other processors
> >     apic_send_IPI_allbutself(NMI_VECTOR)
> >                                                         // NMI interrupt, 
> > dead loop
> >                                                         crash_nmi_callback
> >   printk_safe_flush_on_panic
> >     printk_safe_flush
> >       __printk_safe_flush
> >         // deadlock
> >         raw_spin_lock_irqsave(&read_lock)
> >
> > The register_nmi_handler() can be called in the __crash_kexec() or the
> > crash_smp_send_stop() on the x86-64. Because CPU1 is interrupted by the
> > NMI with holding the read_lock and crash_nmi_callback() never returns,
> > CPU0 can deadlock when printk_safe_flush_on_panic() is called.
> >
> > When we hold the read_lock and then interrupted by the NMI, if the NMI
> > handler call nmi_panic(), it is also can lead to deadlock.
> >
> > In order to fix it, we make read_lock global and rename it to
> > safe_read_lock. And we handle safe_read_lock the same way in
> > printk_safe_flush_on_panic() as we handle logbuf_lock there.
>
> What about the following commit message? It uses imperative language
> and explains that the patch just prevents the deadlock. It removes
> some details. The diagram is better than many words.
>
> <commit message>
> printk_safe_flush_on_panic() caused the following deadlock on our server:
>
> CPU0:                                         CPU1:
> panic                                         rcu_dump_cpu_stacks
>   kdump_nmi_shootdown_cpus                      nmi_trigger_cpumask_backtrace
>     register_nmi_handler(crash_nmi_callback)      printk_safe_flush
>                                                     __printk_safe_flush
>                                                       
> raw_spin_lock_irqsave(&read_lock)
>     // send NMI to other processors
>     apic_send_IPI_allbutself(NMI_VECTOR)
>                                                         // NMI interrupt, 
> dead loop
>                                                         crash_nmi_callback
>   printk_safe_flush_on_panic
>     printk_safe_flush
>       __printk_safe_flush
>         // deadlock
>         raw_spin_lock_irqsave(&read_lock)
>
> DEADLOCK: read_lock is taken on CPU1 and will never get released.
>
> It happens when panic() stops a CPU by NMI while it has been in
> the middle of printk_safe_flush().
>
> Handle the lock the same way as logbuf_lock. The printk_safe buffers
> are flushed only when both locks can be safely taken.
>
> Note: It would actually be safe to re-init the locks when all CPUs were
>       stopped by NMI. But it would require passing this information
>       from arch-specific code. It is not worth the complexity.
>       Especially because logbuf_lock and printk_safe buffers have been
>       obsoleted by the lockless ring buffer.
> </commit message>


Many thanks. It is clear.

>
> > Fixes: cf9b1106c81c ("printk/nmi: flush NMI messages on the system panic")
> > Signed-off-by: Muchun Song <songmuc...@bytedance.com>
>
> With an updated commit message:
>
> Reviewed-by: Petr Mladek <pmla...@suse.com>

Thanks. I will update the commit message.

>
> Best Regards,
> Petr

Re: [External] Re: [PATCH v2] printk: fix deadlock when kernel panic

Reply via email to