On Wed, Aug 14, 2024 at 9:30 AM Nathan Bossart <nathandboss...@gmail.com> wrote: > Another concern is the huge number of PqMsg_Progress messages sent by > parallel workers with that approach. In Bertrand's tests, he was seeing > nearly 350K interrupts for a ~19 minute vacuum (~300 interrupts per > second). That seems a bit extreme to me. I don't see how anyone could > possibly need stats about vacuum delays with that level of accuracy.
I suspect CF #5118 would fix lots of cases of ProcSignal() senders going berserk, because it deletes SendProcSignal(), and introduces SendInterrupt(), which calls SetLatch(), which doesn't send a signal if the latch is already set. Even if the latch is not already set, it only sends a signal if the latch is currently being waited on ("maybe_sleeping" flag). Even when it sends a signal, it goes to a signalfd, kqueue or NT event flag on common platforms. Of course that is only talking about the receiving side. I'm sure we can improve the senders too. There's nothing we can do about NOTIFY, because that's under user control, but that PqMsg_Progress case sounds pretty bad, and the recovery conflict system could probably be made more precise in its logic about who to wake up and when, etc. Other backends going bananas with SendProcSignal() is the reason dsm_impl_posix_resize() has to block signals while calling posix_fallocate(). Unlike nanosleep(), which you can fix by tracking remaining time, posix_fallocate() is all-or-nothing, it has no way to report partial progress, so it must therefore undo its work if interrupted, so its EINTR retry loop could get stuck forever when other backends are trigger-happy with signals, which was a real production issue. I guess both of these issues go away in practice if CF #5118 goes in.