On Tue, May 30, 2017 at 10:22:08AM -0700, Andi Kleen wrote: > > > You would only need a single one per system however, not one per CPU. > > > RCU already tracks all the CPUs, all we need is a single NMI watchdog > > > that makes sure RCU itself does not get stuck. > > > > > > So we just have to find a single watchdog somewhere that can trigger > > > NMI. > > > > But then you have to IPI broadcast the NMI, which is less than ideal. > > Only when the watchdog times out to print the backtraces.
The current NMI watchdog has a per-cpu state. So that means either doing for_all_cpu() loops or IPI broadcasts from the NMI tickle. Neither is something you really want. > > RCU doesn't have that problem because the quiescent state is a global > > thing. CPU progress, which is what the NMI watchdog tests, is very much > > per logical CPU though. > > RCU already has a CPU stall detector. It should work (and usually > triggers before the NMI watchdog in my experience unless the > whole system is dead) It only goes look at CPU state once it detects the global QS is stalled I think. But I've not had much luck with the RCU one -- although I think its been improved since I last had a hard problem.