Hi, On Mon, Sep 15, 2025 at 3:35 AM Peter Zijlstra <pet...@infradead.org> wrote: > > On Mon, Sep 15, 2025 at 11:26:09AM +0100, Will Deacon wrote: > > > | If all CPUs are hard locked up at the same time the buddy system > > | can't detect it. > > > > Ok, so why is that limitation acceptable? It looks to me like you're > > removing useful functionality. > > Yeah, this. I've run into this case waaay too many times to think it > reasonable to remove the perf/NMI based lockup detector.
I am a bit curious how this comes to be in cases where you've seen it. What causes all CPUs to be stuck looping all with interrupts disabled (but still able to execute NMIs)? Certainly one can come up with a synthetic way to make that happen, but I would imagine it to be exceedingly rare in real life. Maybe all CPUs are deadlocked waiting on spinlocks or something? There shouldn't be a lot of other reasons that all CPUs should be stuck indefinitely with interrupts disabled... If that's what's happening, (just spitballing) I wonder if hooking into the slowpath of spinlocks to look for lockups would help? Maybe every 10000 failures to acquire the spinlock we check for a lockup? Obviously you could still come up with synthetic ways to make a non-caught watchdog, but hopefully in those types of cases we can at least reset the device with a hardware watchdog? Overall the issue is that it's really awkward to have both types of lockup detectors, especially since you've got to pick at compile time. The perf lockup detector has a pile of things that make it pretty awkward and it seems like people have been toward the buddy detector because of this... -Doug