On Fri, Nov 18, 2022 at 11:08 AM Tom Lane <t...@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.mu...@gmail.com> writes: > > I wonder why the walreceiver didn't start in > > 008_min_recovery_point_node_3.log here: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2022-11-16%2023%3A13%3A38 > > mamba has been showing intermittent failures in various replication > tests since day one. My guess is that it's slow enough to be > particularly subject to the signal-handler race conditions that we > know exist in walreceivers and elsewhere. (Now, it wasn't any faster > in its previous incarnation as a macOS critter. But maybe modern > NetBSD has different scheduler behavior than ancient macOS and that > contributes somehow. Or maybe there's some other NetBSD weirdness > in here.) > > I've tried to reproduce manually, without much success :-( > > Like many of its other failures, there's a suggestive postmaster > log entry at the very end: > > 2022-11-16 19:45:53.851 EST [2036:4] LOG: received immediate shutdown request > 2022-11-16 19:45:58.873 EST [2036:5] LOG: issuing SIGKILL to recalcitrant > children > 2022-11-16 19:45:58.881 EST [2036:6] LOG: database system is shut down > > So some postmaster child is stuck somewhere where it's not responding > to SIGQUIT. While it's not unreasonable to guess that that's a > walreceiver, there's no hard evidence of it here. I've been wondering > if it'd be worth patching the postmaster so that it's a bit more verbose > about which children it had to SIGKILL. I've also wondered about > changing the SIGKILL to SIGABRT in hopes of reaping a core file that > could be investigated.
I wonder if it's a runtime variant of the other problem. We do load_file("libpqwalreceiver", false) before unblocking signals but maybe don't resolve the symbols until calling them, or something like that...