Corner-case bug in pg_rewind

Ian Barwick Thu, 10 Sep 2020 23:44:10 -0700

Hi

Take the following cluster with:
  - node1 (initial primary)
  - node2 (standby)
  - node3 (standby)


Following activity takes place (greatly simplified from a real-world situation):

1. node1 is shut down.
2. node3 is promoted
3. node2 is attached to node3.
4. node1 is attached to node3
5. node1 is then promoted (creating a split brain situation with
   node1 and node3 as primaries)
6. node2 and node3 are shut down (in that order).
7. pg_rewind is executed to reset node2 so it can reattach
   to node1 as a standby. pg_rewind claims:

    pg_rewind: servers diverged at WAL location X/XXXXXXX on timeline 2
    pg_rewind: no rewind required

8. based off that assurance, node2 is restarted with replication configuration
   pointing to node1 - but it is unable to attach, with node2's log reporting
   something like:

      new timeline 3 forked off current database system timeline 2
before current recovery point X/XXXXXXX

The cause is that pg_rewind is assuming that if the node's last
checkpoint matches the
divergence point, no rewind is needed:

    if (chkptendrec == divergerec)
        rewind_needed = false;

but in this case there *are* records beyond the last checkpoint, which can be
inferred from "minRecoveryPoint" - but this is not checked.

Attached patch addresses this. It includes a test, which doesn't make use of
the RewindTest module, as that hard-codes a primary and a standby, while here
three nodes are needed (I can't come up with a situation where this can be
reproduced with only two nodes). The test sets "wal_keep_size" so would need
modification for Pg12 and earlier.

Regards

Ian Barwick

-- 
  Ian Barwick                   https://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

commit ec0465014825628ec9b868703444214ac4738c53
Author: Ian Barwick <i...@2ndquadrant.com>
Date:   Fri Sep 11 10:13:17 2020 +0900

    pg_rewind: catch corner-case situation
    
    It's possible that a standby, after diverging from the source node,
    is shut down without a shutdown checkpoint record, and the divergence
    point matches a shutdown checkpoint from a previous shutdown. In this
    case the presence of WAL records beyond the shutdown checkpoint (asi
    indicated by minRecoveryPoint) needs to be detected in order to
    determine whether a rewind is needed.

diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 23fc749e44..393c8ebbcd 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -342,13 +342,20 @@ main(int argc, char **argv)
 										targetNentries - 1,
 										restore_command);
 
+			/*
+			 * If the minimum recovery ending location is beyond the end of
+			 * the last checkpoint, that means there are WAL records beyond
+			 * the divergence point and a rewind is needed.
+			 */
+			if (ControlFile_target.minRecoveryPoint > chkptendrec)
+				rewind_needed = true;
 			/*
 			 * If the histories diverged exactly at the end of the shutdown
 			 * checkpoint record on the target, there are no WAL records in
 			 * the target that don't belong in the source's history, and no
 			 * rewind is needed.
 			 */
-			if (chkptendrec == divergerec)
+			else if (chkptendrec == divergerec)
 				rewind_needed = false;
 			else
 				rewind_needed = true;
diff --git a/src/bin/pg_rewind/t/007_min_recovery_point.pl b/src/bin/pg_rewind/t/007_min_recovery_point.pl
new file mode 100644
index 0000000000..ac842fd0ed
--- /dev/null
+++ b/src/bin/pg_rewind/t/007_min_recovery_point.pl
@@ -0,0 +1,118 @@
+#
+# Test situation where a target data directory contains
+# WAL records beyond both the last checkpoint and the divergence
+# point.
+#
+# This test does not make use of RewindTest as it requires three
+# nodes.
+
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+my $node_1 = get_new_node('node_1');
+$node_1->init(allows_streaming => 1);
+$node_1->append_conf(
+		'postgresql.conf', qq(
+wal_keep_size=16
+));
+
+$node_1->start;
+
+# Add an arbitrary table
+$node_1->safe_psql('postgres',
+	'CREATE TABLE public.foo (id INT)');
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_1->backup($backup_name);
+
+# Create streaming standby from backup
+my $node_2 = get_new_node('node_2');
+$node_2->init_from_backup($node_1, $backup_name,
+	has_streaming => 1);
+
+$node_2->append_conf(
+		'postgresql.conf', qq(
+wal_keep_size=16
+));
+
+$node_2->start;
+
+# Create streaming standby from backup
+my $node_3 = get_new_node('node_3');
+$node_3->init_from_backup($node_1, $backup_name,
+	has_streaming => 1);
+
+$node_3->append_conf(
+		'postgresql.conf', qq(
+wal_keep_size=16
+));
+
+$node_3->start;
+
+# Stop node_1
+
+$node_1->stop('fast');
+
+# Promote node_3
+$node_3->promote;
+
+# node_1 rejoins node_3
+
+my $node_3_connstr = $node_3->connstr;
+
+$node_1->append_conf(
+		'postgresql.conf', qq(
+primary_conninfo='$node_3_connstr'
+));
+$node_1->set_standby_mode();
+$node_1->start();
+
+# node_2 follows node_3
+
+$node_2->append_conf(
+		'postgresql.conf', qq(
+primary_conninfo='$node_3_connstr'
+));
+$node_2->restart();
+
+# Promote node_1
+
+$node_1->promote;
+
+# We now have a split-brain with two primaries. Insert a row on both to
+# demonstratively create a split brain; this is not strictly necessary
+# for this test, but creates an easily identifiable WAL record and
+# enables us to verify that node_2 has the required changes to
+# reproduce the situation we're handling.
+
+$node_1->safe_psql('postgres', 'INSERT INTO public.foo (id) VALUES (0)');
+$node_3->safe_psql('postgres', 'INSERT INTO public.foo (id) VALUES (1)');
+
+$node_2->poll_query_until('postgres',
+	q|SELECT COUNT(*) > 0 FROM public.foo|, 't');
+
+# At this point node_2 will shut down without a shutdown checkpoint,
+# but with WAL entries beyond the preceding shutdown checkpoint.
+$node_2->stop('fast');
+$node_3->stop('fast');
+
+
+my $node_2_pgdata = $node_2->data_dir;
+my $node_1_connstr = $node_1->connstr;
+
+command_checks_all(
+    [
+        'pg_rewind',
+        "--source-server=$node_1_connstr",
+        "--target-pgdata=$node_2_pgdata",
+        '--dry-run'
+    ],
+    0,
+    [],
+    [qr|rewinding from last common checkpoint at|],
+    'pg_rewind detects rewind needed');
+

Corner-case bug in pg_rewind

Reply via email to