On Mon, Feb 23, 2015 at 09:12:29AM +0000, Naoya Horiguchi wrote: > kexec disables (or "shoots down") all CPUs other than a crashing CPU before > entering the 2nd kernel. This disablement is done via NMI, and the crashing > CPU wait for the completions by spinning at most for 1 second. > However, there is a race window if this NMI handling doesn't complete within > the 1 second on some CPU, which cause the fragile situation where only a > portion of online CPUs are responsive to MCE interrupt. If MCE happens during > this race window, MCE synchronization always timeouts and results in kernel > panic. So the user-visible effect of this bug is kdump failure. > > Note that this race window did exist when current MCE handler was implemented > around 2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has > reached a timeout") made it more visible by changing the default behavior of > the synchronization timeout from "ignore" to "panic".
Let me guess: you could raise the tolerance level to 3 temporarily from native_machine_crash_shutdown() and not touch the #MC handler at all, right? -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/