On Wed, Feb 03, 2016 at 10:23:42AM -0700, Jeffrey Merkey wrote: > > Hmm, I am confused here. So you are saying because we are in the nmi > > handler you can not break into the system? The nmi handler prints some > > stuff to the screen, pokes the other cpus to print stuff to the screen and > > then returns to a normal operation. Unless you are saying the act of > > sending NMI IPIs never completes (because a cpu is blocking IPI > > interrupts), > > so the cpu hangs in nmi context and the debugger never has a chance to > > 'break' in and see what is going on? > > > > Cheers, > > Don > > > > Yes. the nmi handlers never complete for the bug I worked on with > tglx, probably because an nmi handler is calling timekeeper.c > somewhere. Some of these lockup bugs may be calling code from the nmi > handlers that cause the lockup condition in the first place in some > cases, so it will never reach a call to panic. Looking over this code > it's damn hard to find a good way to do this that works across all the > arches without adding another macro to bug.h (BREAK_ON maybe), so I > just used one that's already there. I'll go back and rethink this > some more. It could just be as simple as calling panic from the first > detection -- that works.
So, if you disable 'sysctl_hardlockup_all_cpu_backtrace' and enable 'hardlockup_panic', you should be able to achieve what you want, no? But you mentioned you wanted to recover? Hence avoiding the panic? Cheers, Don