Re: Stronger safeguard for archive recovery not to miss data

Fujii Masao Sun, 04 Apr 2021 20:35:16 -0700


On 2021/04/04 11:58, osumi.takami...@fujitsu.com wrote:

IMO it's better to comment why this server restart is necessary.
As far as I understand correctly, this is necessary to ensure the WAL file
containing the record about the change of wal_level (to minimal) is archived,
so that the subsequent archive recovery will be able to replay it.

OK, added some comments. Further, I felt the way I wrote this part was not good 
at all and self-evident
and developers who read this test would feel uneasy about that point.
So, a little bit fixed that test so that we can get clearer conviction for wal 
archive.


LGTM. Thanks for updating the patch!

Attached is the updated version of the patch. I applied the following changes.
Could you review this version? Barring any objection, I'm thinking to
commit this.

+sub check_wal_level
+{
+       my ($target_wal_level, $explanation) = @_;
+
+       is( $node->safe_psql(
+                       'postgres', q{show wal_level}),
+        $target_wal_level,
+        $explanation);

Do we really need this test? This test doesn't seem to be directly related
to the test purpose. It seems to be testing the behavior of the PostgresNode
functions. So I removed this from the patch.

+# This protection should apply to recovery mode
+my $another_node = get_new_node('another_node');

The same test is performed twice with different settings. So isn't it better
to merge the code for both tests into one common function, to simplify
the code? I did that.

I also applied some cosmetic changes.

By the way, when I build postgres with this patch and enable-coverage
option, the results of RT becomes unstable. Does someone know the

reason ?

When it fails, I get stderr like below


I have no idea about this. Does this happen even without the patch?

Unfortunately, no. I get this only with --enable-coverage and with my patch,
althought regression tests have passed with this patch.
OSS HEAD doesn't produce the stderr even with --enable-coverage.


Could you check whether the latest patch still causes this issue or not?
If it still causes, could you check which part (the change of xlog.c or
the addition of regression test) caused the issue?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 6f8810e149..27d9ec9f91 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6403,9 +6403,11 @@ CheckRequiredParameterValues(void)
         */
        if (ArchiveRecoveryRequested && ControlFile->wal_level == 
WAL_LEVEL_MINIMAL)
        {
-               ereport(WARNING,
-                               (errmsg("WAL was generated with 
wal_level=minimal, data may be missing"),
-                                errhint("This happens if you temporarily set 
wal_level=minimal without taking a new base backup.")));
+               ereport(FATAL,
+                               (errmsg("WAL was generated with 
wal_level=minimal, cannot continue recovering"),
+                                errdetail("This happens if you temporarily set 
wal_level=minimal on the server."),
+                                errhint("Use a backup taken after setting 
wal_level to higher than minimal "
+                                                "or recover to the point in 
time before wal_level was changed to minimal even though it may cause data 
loss.")));
        }
 
        /*
@@ -6414,11 +6416,6 @@ CheckRequiredParameterValues(void)
         */
        if (ArchiveRecoveryRequested && EnableHotStandby)
        {
-               if (ControlFile->wal_level < WAL_LEVEL_REPLICA)
-                       ereport(ERROR,
-                                       (errmsg("hot standby is not possible 
because wal_level was not set to \"replica\" or higher on the primary server"),
-                                        errhint("Either set wal_level to 
\"replica\" on the primary, or turn off hot_standby here.")));
-
                /* We ignore autovacuum_max_workers when we make this test. */
                RecoveryRequiresIntParameter("max_connections",
                                                                         
MaxConnections,
diff --git a/src/test/recovery/t/024_archive_recovery.pl 
b/src/test/recovery/t/024_archive_recovery.pl
new file mode 100644
index 0000000000..a00314ddc6
--- /dev/null
+++ b/src/test/recovery/t/024_archive_recovery.pl
@@ -0,0 +1,95 @@
+# Test for archive recovery of WAL generated with wal_level=minimal
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+
+# Initialize and start node with wal_level = replica and WAL archiving
+# enabled.
+my $node = get_new_node('orig');
+$node->init(has_archiving => 1);
+my $replica_config = q[
+wal_level = replica
+archive_mode = on
+max_wal_senders = 10
+hot_standby = off
+];
+$node->append_conf('postgresql.conf', $replica_config);
+$node->start;
+
+# Take backup
+my $backup_name = 'my_backup';
+$node->backup($backup_name);
+
+# Restart node with wal_level = minimal and WAL archiving disabled
+# to generate WAL with that setting. Note that such WAL has not been
+# archived yet at this moment because WAL archiving is not enabled.
+$node->append_conf(
+       'postgresql.conf', q[
+wal_level = minimal
+archive_mode = off
+max_wal_senders = 0
+]);
+$node->restart;
+
+# Restart node with wal_level = replica and WAL archiving enabled
+# to archive WAL previously generated with wal_level = minimal.
+# We ensure the WAL file containing the record indicating the change
+# of wal_level to minimal is archived by checking pg_stat_archiver.
+$node->append_conf('postgresql.conf', $replica_config);
+$node->restart;
+
+# Find next WAL segment to be archived
+my $walfile_to_be_archived = $node->safe_psql('postgres',
+       "SELECT pg_walfile_name(pg_current_wal_lsn());");
+
+# Make WAL segment eligible for archival
+$node->safe_psql('postgres', 'SELECT pg_switch_wal()');
+my $archive_wait_query
+  = "SELECT '$walfile_to_be_archived' <= last_archived_wal FROM 
pg_stat_archiver;";
+
+# Wait until the WAL segment has been archived.
+$node->poll_query_until('postgres', $archive_wait_query)
+  or die "Timed out while waiting for WAL segment to be archived";
+
+$node->stop;
+
+# Initialize new node from backup, and start archive recovery. Check that
+# archive recovery fails with an error when it detects the WAL record
+# indicating the change of wal_level to minimal and node stops.
+sub test_recovery_wal_level_minimal
+{
+       my ($node_name, $node_text, $standby_setting) = @_;
+
+       my $recovery_node = get_new_node($node_name);
+       $recovery_node->init_from_backup(
+               $node, $backup_name,
+               has_restoring => 1, standby => $standby_setting);
+
+       # Use run_log instead of recovery_node->start because this test expects
+       # that the server ends with an error during recovery.
+       run_log(
+               ['pg_ctl','-D', $recovery_node->data_dir, '-l',
+                $recovery_node->logfile, 'start']);
+
+       # Wait up to 180s for postgres to terminate
+       foreach my $i (0 .. 1800)
+       {
+               last if !-f $recovery_node->data_dir . '/postmaster.pid';
+               usleep(100_000);
+       }
+
+       # Confirm that the archive recovery fails with an error
+       my $logfile = slurp_file($recovery_node->logfile());
+       ok( $logfile =~
+               qr/FATAL:  WAL was generated with wal_level=minimal, cannot 
continue recovering/,
+               "$node_text ends with an error because it finds WAL generated 
with wal_level=minimal");
+}
+
+# Test for  archive recovery
+test_recovery_wal_level_minimal('archive_recovery', 'archive recovery', 0);
+
+# Test for standby server
+test_recovery_wal_level_minimal('standby', 'standby', 1);

Re: Stronger safeguard for archive recovery not to miss data

Reply via email to