On Thu, Apr 11, 2024 at 3:52 PM Andres Freund <and...@anarazel.de> wrote: > My suspicion is that most of the false positives are caused by lots of signals > interrupting the pg_usleep()s. Because we measure the number of delays, not > the actual time since we've been waiting for the spinlock, signals > interrupting pg_usleep() trigger can very significantly shorten the amount of > time until we consider a spinlock stuck. We should fix that.
I mean, go nuts. But <dons asbestos underpants, asbestos regular pants, 2 pair of asbestos socks, 3 asbestos shirts, 2 asbestos jackets, and then hides inside of a flame-proof capsule at the bottom of the Pacific ocean> this is just another thing like query hints, where everybody says "oh, the right thing to do is fix X or Y or Z and then you won't need it". But of course it never actually gets fixed well enough that people stop having problems in the real world. And eventually we look like a developer community that cares more about our own opinion about what is right than what the experience of real users actually is. > I don't think that's a fair description of the situation. It supposes that the > alternative to the PANIC is that the problem is detected and resolved some > other way. But, depending on the spinlock, the problem will not be detected by > automated checks for the system being up. IME you end up with a system that's > degraded in a complicated hard to understand way, rather than one that's just > down. I'm interested to read that you've seen this actually happen and that you got that result. What I would have thought would happen is that, within a relatively short period of time, every backend in the system would pile up waiting for that spinlock and the whole system would become completely unresponsive. I mean, I know it depends on exactly which spinlock it is. But, I would have thought that if this was happening, it would be happening because some regular backend died in a weird way, and if that is indeed what happened, then it's likely that the other backends are doing similar kinds of work, because that's how application workloads typically behave, so they'll probably all hit the part of the code where they need that spinlock too, and now everybody's just spinning. If it's something like a WAL receiver mutex or the checkpointer mutex or even a parallel query mutex, then I guess it would look different. But even then, what I'd expect to see is all backends of type X pile up on the stuck mutex, and when you check 'ps' or 'top', you go "oh hmm, all my WAL receivers are at 100% CPU" and you get a backtrace or an strace and you go "hmm". Now, I agree that in this kind of scenario where only some backends lock up, automated checks are not necessarily going to notice the problem - but a PANIC is hardly better. Now you just have a system that keeps PANICing, which liveness checks aren't necessarily going to notice either. In all seriousness, I'd really like to understand what experience you've had that makes this check seem useful. Because I think all of my experiences with it have been bad. If they weren't, the last good one was a very long time ago. -- Robert Haas EDB: http://www.enterprisedb.com