On 2021-Jun-10, Álvaro Herrera wrote: > Here's a version that I feel is committable (0001). There was an issue > when returning from the inner loop, if in a previous iteration we had > released the lock. In that case we need to return with the lock not > held, so that the caller can acquire it again, but weren't. This seems > pretty hard to hit in practice (I suppose somebody needs to destroy the > slot just as checkpointer killed the walsender, but before checkpointer > marks it as its own) ... but if it did happen, I think checkpointer > would block with no recourse. Also added some comments and slightly > restructured the code. > > There are plenty of conflicts in pg13, but it's all easy to handle.
Pushed, with additional minor changes. > I wrote a test (0002) to cover the case of signalling a walsender, which > is currently not covered (we only deal with the case of a standby that's > not running). There are some sharp edges in this code -- I had to make > it use background_psql() to send a CHECKPOINT, which hangs, because I > previously send a SIGSTOP to the walreceiver. Maybe there's a better > way to achieve a walreceiver that remains connected but doesn't consume > input from the primary, but I don't know what it is. Anyway, the code > becomes covered with this. I would like to at least see it in master, > to gather some reactions from buildfarm. I tried hard to make this stable, but it just isn't (it works fine one thousand runs, then I grab some coffee and run it once more and that one fails. Why? that's not clear to me). Attached is the last one I have, in case somebody wants to make it better. Maybe there's some completely different approach that works better, but I'm out of ideas for now. -- Álvaro Herrera Valdivia, Chile "La experiencia nos dice que el hombre peló millones de veces las patatas, pero era forzoso admitir la posibilidad de que en un caso entre millones, las patatas pelarían al hombre" (Ijon Tichy)
>From 6b6cb174452c553437ba7949aa25f305f599d6b7 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera <alvhe...@alvh.no-ip.org> Date: Fri, 11 Jun 2021 12:21:45 -0400 Subject: [PATCH v3] Add test case for invalidating an active replication slot MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The code to signal a running walsender was completely uncovered before. This test involves sending SIGSTOP to a walreceiver and running a checkpoint while advancing WAL. I'm not very certain that this test is stable, so it's in master only, and separate from the code-fix commit so that it can be reverted easily if need be. Author: Álvaro Herrera <alvhe...@alvh.no-ip.org> Discussion: https://postgr.es/m/202106102202.mjw4huiix7lo@alvherre.pgsql --- src/test/recovery/t/019_replslot_limit.pl | 79 ++++++++++++++++++++++- 1 file changed, 76 insertions(+), 3 deletions(-) diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl index 7094aa0704..dcadfe1252 100644 --- a/src/test/recovery/t/019_replslot_limit.pl +++ b/src/test/recovery/t/019_replslot_limit.pl @@ -11,7 +11,7 @@ use TestLib; use PostgresNode; use File::Path qw(rmtree); -use Test::More tests => 14; +use Test::More tests => $TestLib::windows_os ? 14 : 17; use Time::HiRes qw(usleep); $ENV{PGDATABASE} = 'postgres'; @@ -211,8 +211,8 @@ for (my $i = 0; $i < 10000; $i++) } ok($failed, 'check that replication has been broken'); -$node_primary->stop('immediate'); -$node_standby->stop('immediate'); +$node_primary->stop; +$node_standby->stop; my $node_primary2 = get_new_node('primary2'); $node_primary2->init(allows_streaming => 1); @@ -253,6 +253,79 @@ my @result = timeout => '60')); is($result[1], 'finished', 'check if checkpoint command is not blocked'); +$node_primary2->stop; +$node_standby->stop; + +# The next test depends on Perl's `kill`, which apparently is not +# portable to Windows. (It would be nice to use Test::More's `subtest`, +# but that's not in the ancient version we require.) +if ($TestLib::windows_os) +{ + done_testing(); + exit; +} + +# Get a slot terminated while the walsender is active +# We do this by sending SIGSTOP to the walreceiver. Skip this on Windows. +my $node_primary3 = get_new_node('primary3'); +$node_primary3->init(allows_streaming => 1, extra => ['--wal-segsize=1']); +$node_primary3->append_conf( + 'postgresql.conf', qq( + min_wal_size = 2MB + max_wal_size = 2MB + log_checkpoints = yes + max_slot_wal_keep_size = 1MB + )); +$node_primary3->start; +$node_primary3->safe_psql('postgres', + "SELECT pg_create_physical_replication_slot('rep3')"); +# Take backup +$backup_name = 'my_backup'; +$node_primary3->backup($backup_name); +# Create standby +my $node_standby3 = get_new_node('standby_3'); +$node_standby3->init_from_backup($node_primary3, $backup_name, + has_streaming => 1); +$node_standby3->append_conf('postgresql.conf', "primary_slot_name = 'rep3'"); +$node_standby3->start; +$node_primary3->wait_for_catchup($node_standby3->name, 'replay'); +my $pid = $node_standby3->safe_psql('postgres', + "SELECT pid FROM pg_stat_activity WHERE backend_type = 'walreceiver'"); +like($pid, qr/^[0-9]+$/, "have walreceiver pid $pid"); + +# freeze walreceiver. Slot will still be active, but it won't advance +kill 'STOP', $pid; +$logstart = get_log_size($node_primary3); +advance_wal($node_primary3, 2); + +my ($in, $out, $timer, $h); +$timer = IPC::Run::timeout(180); +$h = $node_primary3->background_psql('postgres', \$in, \$out, $timer); +$in .= qq{ +checkpoint; +}; +$h->pump_nb; +ok(find_in_log($node_primary3, "to release replication slot"), + "walreceiver termination logged"); + +advance_wal($node_primary3, 2); + +# Now let it continue to its demise +kill 'CONT', $pid; + +$node_primary3->poll_query_until('postgres', + "SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'", + "lost") + or die "timed out waiting for slot to be lost"; + +ok( find_in_log( + $node_primary3, 'invalidating slot "rep3" because its restart_lsn'), + "slot invalidation logged"); + + +$node_primary3->stop; +$node_standby3->stop; + ##################################### # Advance WAL of $node by $n segments sub advance_wal -- 2.20.1