On 11/09/2020 09:42, Ian Barwick wrote:
Take the following cluster with:
   - node1 (initial primary)
   - node2 (standby)
   - node3 (standby)

Following activity takes place (greatly simplified from a real-world situation):

1. node1 is shut down.
2. node3 is promoted
3. node2 is attached to node3.
4. node1 is attached to node3
5. node1 is then promoted (creating a split brain situation with
    node1 and node3 as primaries)
6. node2 and node3 are shut down (in that order).
7. pg_rewind is executed to reset node2 so it can reattach
    to node1 as a standby. pg_rewind claims:

     pg_rewind: servers diverged at WAL location X/XXXXXXX on timeline 2
     pg_rewind: no rewind required

8. based off that assurance, node2 is restarted with replication configuration
    pointing to node1 - but it is unable to attach, with node2's log reporting
    something like:

       new timeline 3 forked off current database system timeline 2
before current recovery point X/XXXXXXX

The cause is that pg_rewind is assuming that if the node's last
checkpoint matches the
divergence point, no rewind is needed:

     if (chkptendrec == divergerec)
         rewind_needed = false;

but in this case there *are* records beyond the last checkpoint, which can be
inferred from "minRecoveryPoint" - but this is not checked.

Yep, I think you're right.

Attached patch addresses this. It includes a test, which doesn't make use of
the RewindTest module, as that hard-codes a primary and a standby, while here
three nodes are needed (I can't come up with a situation where this can be
reproduced with only two nodes). The test sets "wal_keep_size" so would need
modification for Pg12 and earlier.

I think we also need to change the extractPageMap() call:

        /*
         * Read the target WAL from last checkpoint before the point of fork, to
         * extract all the pages that were modified on the target cluster after
         * the fork. We can stop reading after reaching the final shutdown 
record.
         * XXX: If we supported rewinding a server that was not shut down 
cleanly,
         * we would need to replay until the end of WAL here.
         */
        if (showprogress)
                pg_log_info("reading WAL in target");
        extractPageMap(datadir_target, chkptrec, lastcommontliIndex,
                                   ControlFile_target.checkPoint, 
restore_command);
        filemap_finalize();

so that it scans all the way up to minRecoveryPoint, instead of stopping at ControlFile_target.checkPoint. Otherwise, we will fail to rewind changes that happened after the last checkpoint. And we need to do that regardless of the "no rewind required" bug, any time we rewind a server that's in DB_SHUTDOWNED_IN_RECOVERY state. I'm surprised none of the existing tests have caught that. Am I missing something?

- Heikki


Reply via email to