Hi, On 2022-02-22 18:06:24 -0800, Andres Freund wrote: > On 2022-02-18 15:14:15 -0800, Andres Freund wrote: > > I'm running out of ideas for how to try to reproduce this. I think we might > > need some additional debugging information to get more information from the > > buildfarm. > > > I'm thinking of adding log_min_messages=DEBUG2 to primary3, passing > > --verbose > > to pg_basebackup in $node_primary3->backup(...). > > > > It might also be worth adding DEBUG2 messages to ReplicationSlotShmemExit(), > > ReplicationSlotCleanup(), InvalidateObsoleteReplicationSlots(). > > Planning to commit something like the attached.
This did provide us a bit more detail. Seems to suggest something is holding a problematic lock in a way that I do not understand yet: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2022-02-23%2013%3A47%3A20&stg=recovery-check 2022-02-23 09:09:52.299 EST [2022-02-23 09:09:52 EST 1997084:6] 019_replslot_limit.pl LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_1997084" TEMPORARY PHYSICAL ( RESERVE_WAL) ... 2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:14] 019_replslot_limit.pl DEBUG: shmem_exit(0): 4 before_shmem_exit callbacks to make 2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:15] 019_replslot_limit.pl DEBUG: replication slot exit hook, without active slot 2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:16] 019_replslot_limit.pl DEBUG: temporary replication slot cleanup: begin last message from 1997084 until the immediate shutdown. Just checking for temporary replication slots that need to be dropped requires ReplicationSlotControlLock. Actually dropping also requires ReplicationSlotAllocationLock 1997084 does have to a temporary replication slot to clean up. 2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:35] 019_replslot_limit.pl DEBUG: shmem_exit(0): 4 before_shmem_exit callbacks to make 2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:36] 019_replslot_limit.pl DEBUG: replication slot exit hook, without active slot 2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:37] 019_replslot_limit.pl DEBUG: temporary replication slot cleanup: begin 2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:38] 019_replslot_limit.pl DEBUG: temporary replication slot cleanup: done 1997083 succeeds in doing the cleanup. It does not have a temporary replication slot. Making it look like somehow ReplicationSlotAllocationLock hasn't been released. 2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:39] 019_replslot_limit.pl DEBUG: shmem_exit(0): 7 on_shmem_exit callbacks to make ... 2022-02-23 09:09:53.076 EST [2022-02-23 09:09:52 EST 1997072:87] LOG: received immediate shutdown request ... 2022-02-23 09:09:53.095 EST [2022-02-23 09:09:52 EST 1997072:90] DEBUG: server process (PID 1997084) exited with exit code 2 It's *possible*, but not likely, that somehow 1997084 just doesn't get scheduled for a prolonged amount of time. We could be more certain if we shut down the cluster in fast rather than immediate mode. So I'm thinking of doing something like # We've seen occasionales cases where multiple walsender pids are active. An # immediate shutdown may hide evidence of a locking bug. So if multiple # walsenders are observed, shut down in fast mode, and collect some more # information. if (not like($senderpid, qr/^[0-9]+$/, "have walsender pid $senderpid")) { my ($stdout, $stderr); $node_primary3->psql('postgres', "\\a\\t\nSELECT * FROM pg_stat_activity", stdout => \$stdout, stderr => \$stderr); diag $stdout, $stderr; $node_primary3->stop('fast'); $node_standby3->stop('fast'); die "could not determine walsender pid, can't continue"; } Does that make sense? Better ideas? Greetings, Andres Freund