On Mon, Sep 15, 2025 at 08:42:00AM -0700, Doug Anderson wrote: > On Mon, Sep 15, 2025 at 3:35 AM Peter Zijlstra <pet...@infradead.org> wrote: > > > > On Mon, Sep 15, 2025 at 11:26:09AM +0100, Will Deacon wrote: > > > > > | If all CPUs are hard locked up at the same time the buddy system > > > | can't detect it. > > > > > > Ok, so why is that limitation acceptable? It looks to me like you're > > > removing useful functionality. > > > > Yeah, this. I've run into this case waaay too many times to think it > > reasonable to remove the perf/NMI based lockup detector. > > I am a bit curious how this comes to be in cases where you've seen it. > What causes all CPUs to be stuck looping all with interrupts disabled > (but still able to execute NMIs)? Certainly one can come up with a > synthetic way to make that happen, but I would imagine it to be > exceedingly rare in real life. Maybe all CPUs are deadlocked waiting > on spinlocks or something? There shouldn't be a lot of other reasons > that all CPUs should be stuck indefinitely with interrupts disabled...
The simplest one I often run into is rq->lock getting stuck and then all the other CPUs piling up on that in various ways. Getting stop_machine() stuck is also a fun one. I mean, it really isn't that hard. If, as a full time kernel dev, you don't get into this situation at least a few time a year, you're just not doing your job right ;-) > If that's what's happening, (just spitballing) I wonder if hooking > into the slowpath of spinlocks to look for lockups would help? Maybe > every 10000 failures to acquire the spinlock we check for a lockup? > Obviously you could still come up with synthetic ways to make a > non-caught watchdog, but hopefully in those types of cases we can at > least reset the device with a hardware watchdog? Now, why would I want to make the spinlock code worse if I have a perfectly functional NMI watchdog? > Overall the issue is that it's really awkward to have both types of > lockup detectors, especially since you've got to pick at compile time. Well, then go fix that. Surely this isn't rocket science. > The perf lockup detector has a pile of things that make it pretty > awkward and it seems like people have been toward the buddy detector > because of this... There's nothing awkward about the perf one, except that it takes one counter, and some people are just greedy and want all of them. At the same time, there are people posting patches that use the PMU for page-promotion like things, so these same greedy people are going to hate on that too.