On 12/16/15, Jeff Merkey <linux....@gmail.com> wrote: > On 12/14/15, Jeff Merkey <linux....@gmail.com> wrote: >> The current touch_nmi_watchdog() function in /kernel/watchdog.c does >> not always catch all cases when a processor is spinning in the nmi >> handler inside either KGDB, KDB, or MDB, in particular, the case where >> a processor is being held by a debugger inside an int1 handler. >> >> The hrtimer_interrupts_saved count can still end up matching the >> hrtime value in some cases, resulting in the hard lockup detector >> tagging processors inside a debugger and executing a panic. >> >> The patch below corrects this problem. I did not add this to >> the touch_nmi_function directly becuase of possible affects on >> timing issues since the function is widely used by drivers and >> modules. >> >> I have tested this patch and it fixes the problem for kernel debuggers >> stopping errant hard lockup events when processors are spinning inside >> the debugger. >> >> Signed-off-by: Jeff Merkey <linux....@gmail.com> >> --- >> kernel/watchdog.c | 7 +++++++ >> 1 file changed, 7 insertions(+) >> >> diff --git a/kernel/watchdog.c b/kernel/watchdog.c >> index 18f34cf..b682aab 100644 >> --- a/kernel/watchdog.c >> +++ b/kernel/watchdog.c >> @@ -283,6 +283,13 @@ static bool is_hardlockup(void) >> __this_cpu_write(hrtimer_interrupts_saved, hrint); >> return false; >> } >> + >> +void touch_hardlockup_watchdog(void) >> +{ >> + __this_cpu_write(hrtimer_interrupts_saved, 0); >> +} >> +EXPORT_SYMBOL_GPL(touch_hardlockup_watchdog); >> + >> #endif >> >> static int is_softlockup(unsigned long touch_ts) >> -- >> 1.8.3.1 >> >> > > I got to the bottom of it. It's related to the hardware I am using. > One of the processors is faulting and hanging due to an existing bug > in the hw_breakpoint handler not setting the resume flag (I have > previously reported it and submitted a patch). This breaks your code, > but there's nothing you can do about it. > > There is a severe bug in hw_breakpoint.c that causes int1 recursion > and this whole "lazy debug register switching" nonsense does not work > properly. I am probably the first person to actually test this code > path robustly. I applied the patch that fixes this bug in > hw_breakpoint.c and the problem with your code firing off and ignoring > the touch flag > went away. > > Jeff >
Wow, I figured it out. What's really needed here is the ability to touch all the processors from just one processor. That's what's missing. This per processor nonsense doesn't fly here. A debugger needs to be able to turn off your stuff (it needs an on/off switch) completely when needed. I'll submit a patch for that. I'll just maintain the working version in my patch for the debugger so people can get a working, stable, debugable kernel and not a broken one until fixes start showing up in the tree. Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/