Hi Take the following cluster with: - node1 (initial primary) - node2 (standby) - node3 (standby)
Following activity takes place (greatly simplified from a real-world situation): 1. node1 is shut down. 2. node3 is promoted 3. node2 is attached to node3. 4. node1 is attached to node3 5. node1 is then promoted (creating a split brain situation with node1 and node3 as primaries) 6. node2 and node3 are shut down (in that order). 7. pg_rewind is executed to reset node2 so it can reattach to node1 as a standby. pg_rewind claims: pg_rewind: servers diverged at WAL location X/XXXXXXX on timeline 2 pg_rewind: no rewind required 8. based off that assurance, node2 is restarted with replication configuration pointing to node1 - but it is unable to attach, with node2's log reporting something like: new timeline 3 forked off current database system timeline 2 before current recovery point X/XXXXXXX The cause is that pg_rewind is assuming that if the node's last checkpoint matches the divergence point, no rewind is needed: if (chkptendrec == divergerec) rewind_needed = false; but in this case there *are* records beyond the last checkpoint, which can be inferred from "minRecoveryPoint" - but this is not checked. Attached patch addresses this. It includes a test, which doesn't make use of the RewindTest module, as that hard-codes a primary and a standby, while here three nodes are needed (I can't come up with a situation where this can be reproduced with only two nodes). The test sets "wal_keep_size" so would need modification for Pg12 and earlier. Regards Ian Barwick -- Ian Barwick https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
commit ec0465014825628ec9b868703444214ac4738c53 Author: Ian Barwick <i...@2ndquadrant.com> Date: Fri Sep 11 10:13:17 2020 +0900 pg_rewind: catch corner-case situation It's possible that a standby, after diverging from the source node, is shut down without a shutdown checkpoint record, and the divergence point matches a shutdown checkpoint from a previous shutdown. In this case the presence of WAL records beyond the shutdown checkpoint (asi indicated by minRecoveryPoint) needs to be detected in order to determine whether a rewind is needed. diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c index 23fc749e44..393c8ebbcd 100644 --- a/src/bin/pg_rewind/pg_rewind.c +++ b/src/bin/pg_rewind/pg_rewind.c @@ -342,13 +342,20 @@ main(int argc, char **argv) targetNentries - 1, restore_command); + /* + * If the minimum recovery ending location is beyond the end of + * the last checkpoint, that means there are WAL records beyond + * the divergence point and a rewind is needed. + */ + if (ControlFile_target.minRecoveryPoint > chkptendrec) + rewind_needed = true; /* * If the histories diverged exactly at the end of the shutdown * checkpoint record on the target, there are no WAL records in * the target that don't belong in the source's history, and no * rewind is needed. */ - if (chkptendrec == divergerec) + else if (chkptendrec == divergerec) rewind_needed = false; else rewind_needed = true; diff --git a/src/bin/pg_rewind/t/007_min_recovery_point.pl b/src/bin/pg_rewind/t/007_min_recovery_point.pl new file mode 100644 index 0000000000..ac842fd0ed --- /dev/null +++ b/src/bin/pg_rewind/t/007_min_recovery_point.pl @@ -0,0 +1,118 @@ +# +# Test situation where a target data directory contains +# WAL records beyond both the last checkpoint and the divergence +# point. +# +# This test does not make use of RewindTest as it requires three +# nodes. + +use strict; +use warnings; +use PostgresNode; +use TestLib; +use Test::More tests => 2; + +my $node_1 = get_new_node('node_1'); +$node_1->init(allows_streaming => 1); +$node_1->append_conf( + 'postgresql.conf', qq( +wal_keep_size=16 +)); + +$node_1->start; + +# Add an arbitrary table +$node_1->safe_psql('postgres', + 'CREATE TABLE public.foo (id INT)'); + +# Take backup +my $backup_name = 'my_backup'; +$node_1->backup($backup_name); + +# Create streaming standby from backup +my $node_2 = get_new_node('node_2'); +$node_2->init_from_backup($node_1, $backup_name, + has_streaming => 1); + +$node_2->append_conf( + 'postgresql.conf', qq( +wal_keep_size=16 +)); + +$node_2->start; + +# Create streaming standby from backup +my $node_3 = get_new_node('node_3'); +$node_3->init_from_backup($node_1, $backup_name, + has_streaming => 1); + +$node_3->append_conf( + 'postgresql.conf', qq( +wal_keep_size=16 +)); + +$node_3->start; + +# Stop node_1 + +$node_1->stop('fast'); + +# Promote node_3 +$node_3->promote; + +# node_1 rejoins node_3 + +my $node_3_connstr = $node_3->connstr; + +$node_1->append_conf( + 'postgresql.conf', qq( +primary_conninfo='$node_3_connstr' +)); +$node_1->set_standby_mode(); +$node_1->start(); + +# node_2 follows node_3 + +$node_2->append_conf( + 'postgresql.conf', qq( +primary_conninfo='$node_3_connstr' +)); +$node_2->restart(); + +# Promote node_1 + +$node_1->promote; + +# We now have a split-brain with two primaries. Insert a row on both to +# demonstratively create a split brain; this is not strictly necessary +# for this test, but creates an easily identifiable WAL record and +# enables us to verify that node_2 has the required changes to +# reproduce the situation we're handling. + +$node_1->safe_psql('postgres', 'INSERT INTO public.foo (id) VALUES (0)'); +$node_3->safe_psql('postgres', 'INSERT INTO public.foo (id) VALUES (1)'); + +$node_2->poll_query_until('postgres', + q|SELECT COUNT(*) > 0 FROM public.foo|, 't'); + +# At this point node_2 will shut down without a shutdown checkpoint, +# but with WAL entries beyond the preceding shutdown checkpoint. +$node_2->stop('fast'); +$node_3->stop('fast'); + + +my $node_2_pgdata = $node_2->data_dir; +my $node_1_connstr = $node_1->connstr; + +command_checks_all( + [ + 'pg_rewind', + "--source-server=$node_1_connstr", + "--target-pgdata=$node_2_pgdata", + '--dry-run' + ], + 0, + [], + [qr|rewinding from last common checkpoint at|], + 'pg_rewind detects rewind needed'); +