On Thu, Dec 12, 2024 at 9:43 AM Nathan Bossart <nathandboss...@gmail.com> wrote: > My team recently received a report about connection establishment times > increasing substantially from v16 onwards. Upon further investigation, > this seems to have something to do with commit 7389aad (which moved a lot > of postmaster code out of signal handlers) in conjunction with workloads > that generate many parallel workers. I've attached a set of reproduction > steps. The issue seems to be worst on larger machines (e.g., r8g.48xlarge, > r5.24xlarge) when max_parallel_workers/max_worker_process is set very high > (>= 48).
Interesting. > Our theory is that commit 7389aad (and follow-ups like commit 239b175) made > parallel worker processing much more responsive to the point of contending > with incoming connections, and that before this change, the kernel balanced > the execution of the signal handlers and ServerLoop() to prevent this. I > don't have a concrete proposal yet, but I thought it was still worth > starting a discussion. TBH I'm not sure we really need to do anything > since this arguably comes down to a trade-off between connection and worker > responsiveness. One factor is: * Check if the latch is set already. If so, leave the loop * immediately, avoid blocking again. We don't attempt to report any * other events that might also be satisfied. If we had a way to say "no really, gimme everything you have", I guess that'd help. Which reminds me a bit of commit 04a09ee9 (Windows-only problem, making sure that we handle multiple sockets fairly instead of reporting only the lowest priority one); I think it'd work the same way: if you already saw a latch, you'd use a zero timeout for the system call.