On 2/2/16, Jeffrey Merkey <jeffmer...@gmail.com> wrote: >> Because when you catch a bug in the hard lockup detector the system >> just sits there hard hung and you are not able to get into a debugger >> console since the system has crashed and the watchdog code has already >> killed off the other processors and locked up all the NMI interrupt >> handlers, thereby preventing any debugger at all from functioning >> other than a hardware ice, so it's a hell of a lot easier just to >> trigger a break when you detect the first instance of a hard lockup >> before the system is completely hosed. >> > > So this is why Ingo and tglx's suggestion doesn't work. Unless you > can set a breakpoint in the detector coede, once the lockup occurs > about 50% of the time (when the IF flag is not set and interrupts are > disabled), you can't get into a debugger because the system is hosed. > > The way the current hard lockup detector works is a lot like the death > star self-destruct system for linux -- it detects one, tries to IPI > the other processors to dump their stacks, then somewhere down in the > OS all of it locks up -- once and a while I can get it too panic. A > great bug to test your detector with is the one in timekeeper.c tglx > and I worked on. Good luck getting into any debugger when it fires > off. I like the fact this code does not call panic and is somewhat > dynamic allowing recovery of the system, but it takes a healthy system > with a single bug, burns it to the ground, locks up all the > processors, and prevents the debugger from being entered unless a > breakpoint has been set. > > Perhaps this helps you understand. > > Jeff >
And we could just call notify_die here instead and pass a faux debugger exception. That actually is clean and would work too. any handlers out there will behave as though its an int3 instruction. Hmmm. That's an easy patch and I could test it quickly. Jeff