Re: Issue with the PRNG used by Postgres

Robert Haas Thu, 11 Apr 2024 13:46:55 -0700

On Thu, Apr 11, 2024 at 3:52 PM Andres Freund <[email protected]> wrote:
> My suspicion is that most of the false positives are caused by lots of signals
> interrupting the pg_usleep()s. Because we measure the number of delays, not
> the actual time since we've been waiting for the spinlock, signals
> interrupting pg_usleep() trigger can very significantly shorten the amount of
> time until we consider a spinlock stuck.  We should fix that.


I mean, go nuts. But <dons asbestos underpants, asbestos regular
pants, 2 pair of asbestos socks, 3 asbestos shirts, 2 asbestos
jackets, and then hides inside of a flame-proof capsule at the bottom
of the Pacific ocean> this is just another thing like query hints,
where everybody says "oh, the right thing to do is fix X or Y or Z and
then you won't need it". But of course it never actually gets fixed
well enough that people stop having problems in the real world. And
eventually we look like a developer community that cares more about
our own opinion about what is right than what the experience of real
users actually is.

> I don't think that's a fair description of the situation. It supposes that the
> alternative to the PANIC is that the problem is detected and resolved some
> other way. But, depending on the spinlock, the problem will not be detected by
> automated checks for the system being up. IME you end up with a system that's
> degraded in a complicated hard to understand way, rather than one that's just
> down.

I'm interested to read that you've seen this actually happen and that
you got that result. What I would have thought would happen is that,
within a relatively short period of time, every backend in the system
would pile up waiting for that spinlock and the whole system would
become completely unresponsive. I mean, I know it depends on exactly
which spinlock it is. But, I would have thought that if this was
happening, it would be happening because some regular backend died in
a weird way, and if that is indeed what happened, then it's likely
that the other backends are doing similar kinds of work, because
that's how application workloads typically behave, so they'll probably
all hit the part of the code where they need that spinlock too, and
now everybody's just spinning.

If it's something like a WAL receiver mutex or the checkpointer mutex
or even a parallel query mutex, then I guess it would look different.
But even then, what I'd expect to see is all backends of type X pile
up on the stuck mutex, and when you check 'ps' or 'top',  you go "oh
hmm, all my WAL receivers are at 100% CPU" and you get a backtrace or
an strace and you go "hmm". Now, I agree that in this kind of scenario
where only some backends lock up, automated checks are not necessarily
going to notice the problem - but a PANIC is hardly better. Now you
just have a system that keeps PANICing, which liveness checks aren't
necessarily going to notice either.

In all seriousness, I'd really like to understand what experience
you've had that makes this check seem useful. Because I think all of
my experiences with it have been bad. If they weren't, the last good
one was a very long time ago.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Issue with the PRNG used by Postgres

Reply via email to