On Mon, Apr 11, 2022 at 06:48:58PM +1200, Thomas Munro wrote: > 1. This test had some pre-existing bugs/races, which hadn't failed > before due to scheduling, even under Valgrind. The above changes > appear to fix those problems. To Michael for comment.
Yeah, there are two problems here. From what I can see, ensuring the execution of archive_cleanup_command on the standby needs the checkpoint on the primary and the restart point on the standby. So pg_current_wal_lsn() should be located after the primary's checkpoint and not before it so as we are sure that the checkpoint records finds its way to the standby. That's what Tom mentioned upthread. The second problem is to make sure that $standby2 sees the promotion of $standby and its history file, but we also want to recover 00000002.history from some archives to create a RECOVERYHISTORY at recovery for the purpose of the test. Switching to a new segment as proposed by Andres does not seem completely right to me because we are not 100% sure of the ordering an archive is going to happen, no? I think that the logic to create $standby2 from the initial backup of the primary is right, because there is no 00000002.history in it, but we also need to be sure that 00000002.history has been archived once the promotion of $standby is done. This can be validated thanks to the logs, actually. >> What is that second test really testing? >> >> # Check the presence of temporary files specifically generated during >> # archive recovery. To ensure the presence of the temporary history >> # file, switch to a timeline large enough to allow a standby to recover >> # a history file from an archive. As this requires at least two timeline >> # switches, promote the existing standby first. Then create a second >> # standby based on the promoted one. Finally, the second standby is >> # promoted. >> >> Note "Then create a second standby based on the promoted one." - but that's >> not actually what's happening: > > 2. There may also be other problems with the test but those aren't > relevant to skink's failure, which starts on the 5th test. To Michael > for comment. This comes from df86e52, where we want to recovery a history file that would be created as RECOVERYHISTORY and make sure that the file gets removed at the end of recovery. So $standby2 should choose a new timeline different from the one of chosen by $standby. Looking back at what has been done, it seems to me that the comment is the incorrect part: https://www.postgresql.org/message-id/20190930080340.go2...@paquier.xyz All that stuff leads me to the attached. Thoughts? -- Michael
diff --git a/src/test/recovery/t/002_archiving.pl b/src/test/recovery/t/002_archiving.pl index c8f5ffbaf0..45aafcb35c 100644 --- a/src/test/recovery/t/002_archiving.pl +++ b/src/test/recovery/t/002_archiving.pl @@ -24,6 +24,8 @@ $node_primary->backup($backup_name); # Initialize standby node from backup, fetching WAL from archives my $node_standby = PostgreSQL::Test::Cluster->new('standby'); +# Note that this makes the standby archive its contents on the archives +# of the primary. $node_standby->init_from_backup($node_primary, $backup_name, has_restoring => 1); $node_standby->append_conf('postgresql.conf', @@ -44,13 +46,16 @@ $node_standby->start; # Create some content on primary $node_primary->safe_psql('postgres', "CREATE TABLE tab_int AS SELECT generate_series(1,1000) AS a"); -my $current_lsn = - $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();"); # Note the presence of this checkpoint for the archive_cleanup_command # check done below, before switching to a new segment. $node_primary->safe_psql('postgres', "CHECKPOINT"); +# Done after the checkpoint to ensure that the checkpoint gets replayed +# on the standby. +my $current_lsn = + $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();"); + # Force archiving of WAL file to make it present on primary $node_primary->safe_psql('postgres', "SELECT pg_switch_wal()"); @@ -81,10 +86,34 @@ ok( !-f "$data_dir/$recovery_end_command_file", # file, switch to a timeline large enough to allow a standby to recover # a history file from an archive. As this requires at least two timeline # switches, promote the existing standby first. Then create a second -# standby based on the promoted one. Finally, the second standby is -# promoted. +# standby based on the primary, using its archives. Finally, the second +# standby is promoted. $node_standby->promote; +# Wait until the history file has been archived on the archives of the +# primary once the promotion of the standby completes. +my $primary_archive = $node_primary->archive_dir; +my $max_attempts = 10 * $PostgreSQL::Test::Utils::timeout_default; +my $attempts = 0; +while ($attempts < $max_attempts) +{ + last if (-e "$primary_archive/00000002.history"); + + # Wait 0.1 second before retrying. + usleep(100_000); + + $attempts++; +} + +if ($attempts >= $max_attempts) +{ + die "timed out waiting for 00000002.history\n"; +} +else +{ + note "found 00000002.history after $attempts attempts\n" +} + # recovery_end_command should have been triggered on promotion. ok( -f "$data_dir/$recovery_end_command_file", 'recovery_end_command executed after promotion'); @@ -108,14 +137,19 @@ my $log_location = -s $node_standby2->logfile; # Now promote standby2, and check that temporary files specifically # generated during archive recovery are removed by the end of recovery. $node_standby2->promote; + +# Check the logs of the standby to see that the commands have failed. +my $log_contents = slurp_file($node_standby2->logfile, $log_location); my $node_standby2_data = $node_standby2->data_dir; + +like( + $log_contents, + qr/restored log file "00000002.history" from archive/s, + "00000002.history retrieved from the archives"); ok( !-f "$node_standby2_data/pg_wal/RECOVERYHISTORY", "RECOVERYHISTORY removed after promotion"); ok( !-f "$node_standby2_data/pg_wal/RECOVERYXLOG", "RECOVERYXLOG removed after promotion"); - -# Check the logs of the standby to see that the commands have failed. -my $log_contents = slurp_file($node_standby2->logfile, $log_location); like( $log_contents, qr/WARNING:.*recovery_end_command/s,
signature.asc
Description: PGP signature