> Because when you catch a bug in the hard lockup detector the system > just sits there hard hung and you are not able to get into a debugger > console since the system has crashed and the watchdog code has already > killed off the other processors and locked up all the NMI interrupt > handlers, thereby preventing any debugger at all from functioning > other than a hardware ice, so it's a hell of a lot easier just to > trigger a break when you detect the first instance of a hard lockup > before the system is completely hosed. >
So this is why Ingo and tglx's suggestion doesn't work. Unless you can set a breakpoint in the detector coede, once the lockup occurs about 50% of the time (when the IF flag is not set and interrupts are disabled), you can't get into a debugger because the system is hosed. The way the current hard lockup detector works is a lot like the death star self-destruct system for linux -- it detects one, tries to IPI the other processors to dump their stacks, then somewhere down in the OS all of it locks up -- once and a while I can get it too panic. A great bug to test your detector with is the one in timekeeper.c tglx and I worked on. Good luck getting into any debugger when it fires off. I like the fact this code does not call panic and is somewhat dynamic allowing recovery of the system, but it takes a healthy system with a single bug, burns it to the ground, locks up all the processors, and prevents the debugger from being entered unless a breakpoint has been set. Perhaps this helps you understand. Jeff