Re: PITR promote bug: Checkpointer writes to older timeline

Michael Paquier Wed, 17 Mar 2021 20:56:39 -0700

On Wed, Mar 17, 2021 at 05:09:50PM +0900, Michael Paquier wrote:
> Currently with HEAD and back branches, nothing would be broken as
> logical contexts cannot exist in recovery.  Still it would be easy
> to miss the new behavior for anybody attempting to work more on this
> feature in the future if we change read_local_xlog_page to not update
> ThisTimeLineID anymore.  Resetting the value of ThisTimeLineID in
> read_local_xlog_page() does not seem completely right either with this
> argument, as they could be some custom code relying on the existing
> behavior of read_local_xlog_page() to maintain ThisTimeLineID.


I was looking at uses of ThisTimeLineID in the wild, and could not
find it getting checked or used actually in backend-side code that
involved the WAL reader facility.  Even if it brings confidence, it
does not mean that it is not used somewhere :/

> Hmmm.  I am wondering whether the best answer for the moment would not
> be to save and reset ThisTimeLineID just in XlogReadTwoPhaseData(), as
> that's the local change that uses read_local_xlog_page().

And attached is the patch able to achieve that.  At least it is
simple, and does not break the actual assumptions this callback relies
on.  This is rather weak though if there are errors as this is out of
critical sections, still the disease is worse.  I have double-checked
all the existing backend code that uses XLogReadRecord(), and did not
notice any code paths with issues similar to this one.

> The state of the code is really confusing on HEAD, and I'd like to
> think that the best thing we could do in the long-term is to make the
> logical decoding path not rely on ThisTimeLineID at all and decouple
> all that, putting the code in a state sane enough so as we don't
> finish with similar errors as what is discussed on this thread.  That
> would be a work for a different patch though, not for stable
> branches.  And seeing some slot and advancing functions update
> directly ThisTimeLineID is no good either..

However, I'd like to think that we should completely untie the
dependency to ThisTimeLineID in any page read callbacks in core in the
long-term, and potentially clean up any assumptions behind timeline
jumps while in recovery for logical contexts as that cannot happen.
At this stage of the 14 dev cycle, that would be material for 15~, but
I also got to wonder if there is work going on to support logical
decoding on standbys, in particular if this would really rely on
ThisTimeLineID.

Thoughts are welcome.
--
Michael

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c16f..80be6cc7a4 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1316,7 +1316,11 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
  * twophase files and ReadTwoPhaseFile should be used instead.
  *
  * Note clearly that this function can access WAL during normal operation,
- * similarly to the way WALSender or Logical Decoding would do.
+ * similarly to the way WALSender or Logical Decoding would do.  While
+ * accessing WAL, read_local_xlog_page() may change ThisTimeLineID,
+ * particularly if this routine is called for the end-of-recovery checkpoint
+ * in the checkpointer itself, so save its existing value and restore it
+ * once done.
  */
 static void
 XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
@@ -1324,6 +1328,7 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 	XLogRecord *record;
 	XLogReaderState *xlogreader;
 	char	   *errormsg;
+	TimeLineID	save_currtli = ThisTimeLineID;
 
 	xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
 									XL_ROUTINE(.page_read = &read_local_xlog_page,
@@ -1338,6 +1343,14 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 
 	XLogBeginRead(xlogreader, lsn);
 	record = XLogReadRecord(xlogreader, &errormsg);
+
+	/*
+	 * Restore immediately the timeline where it was previously, as
+	 * read_local_xlog_page() could have changed it if the record was read
+	 * while recovery was finishing or if the timeline has jumped in-between.
+	 */
+	ThisTimeLineID = save_currtli;
+
 	if (record == NULL)
 		ereport(ERROR,
 				(errcode_for_file_access(),
diff --git a/src/test/recovery/t/022_pitr_prepared_xact.pl b/src/test/recovery/t/022_pitr_prepared_xact.pl
new file mode 100644
index 0000000000..3a7907b2a0
--- /dev/null
+++ b/src/test/recovery/t/022_pitr_prepared_xact.pl
@@ -0,0 +1,86 @@
+# Test for point-in-time-recovery (PITR) with prepared transactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use File::Compare;
+
+# Initialize and start primary node with WAL archiving
+my $node_primary = get_new_node('primary');
+$node_primary->init(has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', qq{
+max_wal_senders = 10
+wal_level = 'replica'
+max_prepared_transactions = 10});
+$node_primary->start;
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Initialize node for PITR targeting a very specific restore point, just
+# after a PREPARE TRANSACTION is issued so as we finish with a promoted
+# node where this 2PC transaction needs an explicit COMMIT PREPARED.
+my $node_pitr = get_new_node('node_pitr');
+$node_pitr->init_from_backup(
+	$node_primary, $backup_name,
+	standby       => 0,
+	has_restoring => 1);
+$node_pitr->append_conf('postgresql.conf', qq{
+max_prepared_transactions = 10
+recovery_target_name = 'rp'
+recovery_target_action = 'promote'});
+
+# Workload with a prepared transaction and the target restore point.
+$node_primary->psql(
+	'postgres', qq{
+CREATE TABLE foo(i int);
+BEGIN;
+INSERT INTO foo VALUES(1);
+PREPARE TRANSACTION 'fooinsert';
+SELECT pg_create_restore_point('rp');
+INSERT INTO foo VALUES(2);
+});
+
+# Find next WAL segment to be archived
+my $walfile_to_be_archived = $node_primary->safe_psql('postgres',
+	"SELECT pg_walfile_name(pg_current_wal_lsn());");
+
+# Make WAL segment eligible for archival
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Wait until the WAL segment has been archived.
+my $archive_wait_query =
+  "SELECT '$walfile_to_be_archived' <= last_archived_wal FROM pg_stat_archiver;";
+$node_primary->poll_query_until('postgres', $archive_wait_query)
+  or die "Timed out while waiting for WAL segment to be archived";
+my $last_archived_wal_file = $walfile_to_be_archived;
+
+# Now start the PITR node.
+$node_pitr->start;
+
+# Wait until the PITR node exits recovery.
+$node_pitr->poll_query_until('postgres', "SELECT pg_is_in_recovery() = 'f';")
+  or die "Timed out while waiting for PITR promotion";
+
+# Ensure that we didn't write to the older timeline during PITR promotion by
+# checking that the last archived WAL segment was not overwritten during
+# recovery
+my $archive_dir           = $node_primary->archive_dir;
+my $archive_wal_file_path = "$archive_dir/$last_archived_wal_file";
+my $node_pitr_data        = $node_pitr->data_dir;
+my $local_wal_file_path   = "$node_pitr_data/pg_wal/$last_archived_wal_file";
+is(compare($archive_wal_file_path, $local_wal_file_path),
+	qq{0}, "check if the last archived WAL file was overwritten");
+
+# Commit the prepared transaction in the latest timeline and check its result.
+# There should only be one row in the table, coming from the prepared
+# transaction).  The row from the INSERT after the restore point should not
+# show up, since our recovery target was older than the second INSERT done.
+$node_pitr->psql(
+	'postgres', qq{
+COMMIT PREPARED 'fooinsert';
+});
+my $result = $node_pitr->safe_psql('postgres', "SELECT * FROM foo;");
+is($result, qq{1}, "check table contents after COMMIT PREPARED");

signature.asc
Description: PGP signature

Re: PITR promote bug: Checkpointer writes to older timeline

Reply via email to