On Mon, Nov 17, 2014 at 4:22 PM, Luck, Tony <tony.l...@intel.com> wrote: >> It could also be interesting to tweak mce_panic to not actually panic >> the machine but to try to return and stop the test instead. Then real >> debugging could be possible :) > > The lost cpu is *really* lost. Warm reset doesn't fix the machine, I usually > have to do a full power cycle.
How is it even possible that I did that with a few lines of asm? Could this be a hardware bug? Is there some condition that causes #MC delivery to wedge hard enough that even INIT/RESET stops working? Or possibly some CPU got stuck in SMM -- I have no idea what warm reset does these days. My initial attempts to test machine_check in KVM using IPIs are having some issues, probably because I'm not acking the interrupt. I can do it once, but then it stops working. Here's the patch to improve the timeout messages, but given the degree of wedgedness, I can guess what it'll say: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/paranoid&id=e5cbd9d141bde651ecb20f0b65ad13bcef2468d0 --Andy > > -Tony -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/