Re: Strange failure on mamba

Tom Lane Thu, 17 Nov 2022 14:08:21 -0800

Thomas Munro <thomas.mu...@gmail.com> writes:
> I wonder why the walreceiver didn't start in
> 008_min_recovery_point_node_3.log here:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2022-11-16%2023%3A13%3A38


mamba has been showing intermittent failures in various replication
tests since day one.  My guess is that it's slow enough to be
particularly subject to the signal-handler race conditions that we
know exist in walreceivers and elsewhere.  (Now, it wasn't any faster
in its previous incarnation as a macOS critter.  But maybe modern
NetBSD has different scheduler behavior than ancient macOS and that
contributes somehow.  Or maybe there's some other NetBSD weirdness
in here.)

I've tried to reproduce manually, without much success :-(

Like many of its other failures, there's a suggestive postmaster
log entry at the very end:

2022-11-16 19:45:53.851 EST [2036:4] LOG:  received immediate shutdown request
2022-11-16 19:45:58.873 EST [2036:5] LOG:  issuing SIGKILL to recalcitrant 
children
2022-11-16 19:45:58.881 EST [2036:6] LOG:  database system is shut down

So some postmaster child is stuck somewhere where it's not responding
to SIGQUIT.  While it's not unreasonable to guess that that's a
walreceiver, there's no hard evidence of it here.  I've been wondering
if it'd be worth patching the postmaster so that it's a bit more verbose
about which children it had to SIGKILL.  I've also wondered about
changing the SIGKILL to SIGABRT in hopes of reaping a core file that
could be investigated.

                        regards, tom lane

Re: Strange failure on mamba

Reply via email to