Thomas Munro <thomas.mu...@gmail.com> writes: > I wonder why the walreceiver didn't start in > 008_min_recovery_point_node_3.log here: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2022-11-16%2023%3A13%3A38
mamba has been showing intermittent failures in various replication tests since day one. My guess is that it's slow enough to be particularly subject to the signal-handler race conditions that we know exist in walreceivers and elsewhere. (Now, it wasn't any faster in its previous incarnation as a macOS critter. But maybe modern NetBSD has different scheduler behavior than ancient macOS and that contributes somehow. Or maybe there's some other NetBSD weirdness in here.) I've tried to reproduce manually, without much success :-( Like many of its other failures, there's a suggestive postmaster log entry at the very end: 2022-11-16 19:45:53.851 EST [2036:4] LOG: received immediate shutdown request 2022-11-16 19:45:58.873 EST [2036:5] LOG: issuing SIGKILL to recalcitrant children 2022-11-16 19:45:58.881 EST [2036:6] LOG: database system is shut down So some postmaster child is stuck somewhere where it's not responding to SIGQUIT. While it's not unreasonable to guess that that's a walreceiver, there's no hard evidence of it here. I've been wondering if it'd be worth patching the postmaster so that it's a bit more verbose about which children it had to SIGKILL. I've also wondered about changing the SIGKILL to SIGABRT in hopes of reaping a core file that could be investigated. regards, tom lane