On Fri, Dec 07, 2012 at 09:41:13PM +0000, Seiji Aguchi wrote: > [Issue] > > If one cpu ,which is taking a psinfo->buf_lock, > receive NMI from a panicked cpu via smp_send_stop(), > the panicked cpu hangs up in pstore_dump() called by > kmsg_dump(KMSG_DUMP_PANIC) > because the psinfo->buf_lock is taken again in it.
Hi Seiji, I am trying to understand this scenario. You said it is theoretical. Looking through smp_send_stop, it only sends an NMI if the IRQ did not work after a second. So the scenario you are looking at is: cpuA grabs psinfo->buf_lock cpuB panics and calls smp_send_stop smp_send_stop sends IRQ to cpuA after 1 second, cpuB gives up on cpuA and sends an NMI instead cpuA is now in an NMI handler while still holding buf_lock cpuB is deadlocked Now my first reaction would be, if that is the scenario, why couldn't cpuA release the lock within one second. Because if cpuA is stuck talking with firmware, then your patch to force the unlock is probably going to trip over the same problems. (those problems include dealing with resetting a firmware state machine) IOW, your patch is just one of many minefields you will have to cross in order to be successful. Without any testing to show that you have cleared all those minefields, I am leaning against your patch for now. I would rather have a complete patchset that fixes all the problems in this path. > > To avoid the deadlock, an easy solution is moving kmsg_dump above > smp_send_stop() in panic path. > > But, it is not safe to kick pstore while multiple cpus are running in panic > case, > because they may touch corrupted data/variables and unnecessary failures may > happen. > In that case, we can't guarantee that a panicked cpu can log messages reliably > because it may have harmful effects due to the failures. > > [Solution] > > This patch skips taking a psinfo->buf_lock when just one cpu is online > because stopped cpus turn to offline via smp_send_stop() > in some architectures like x86, powerpc or arm64. > > It may be a hack but solves my concern deadlocking in x86 architecture. If you did have to go this route, why not take the 'reason' variable and pass it to some function 'in_panic_path(reason)' that tells you if you are panicing or not. Then you can change the 'if in_nmi()' to 'if in_nmi() || in_panic'. The panic path already disables irqs for you, not sure you need to do it again. Unless you have some other path you are trying to protect in mind. Cheers, Don -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/