Re: Race condition in recovery?

Kyotaro Horiguchi Fri, 21 May 2021 00:49:44 -0700

At Fri, 21 May 2021 11:21:05 +0900 (JST), Kyotaro Horiguchi 
<horikyota....@gmail.com> wrote in 
> At Thu, 20 May 2021 13:49:10 -0400, Robert Haas <robertmh...@gmail.com> wrote 
> in 
>   In the case of (c) recoveryTargetTLI > checkpoint TLI.  In this case
>   we expecte that checkpint TLI is in the history of
>   recoveryTargetTLI. Otherwise recovery failse.  This case is similar
>   to the case (a) but the relationship between recoveryTargetTLI and
>   the checkpoint TLI is not confirmed yet. ReadRecord barks later if
>   they are not compatible so there's not a serious problem but might
>   be better checking the relation ship there.  My first proposal
>   performed mutual check between the two but we need to check only
>   unidirectionally.
> 
>   if (readFile < 0)
>   {
>      if (!expectedTLEs)
>        {
>           expectedTLEs = readTimeLineHistory(receiveTLI);
> +       if (!tliOfPointInHistory(receiveTLI, expectedTLEs))
> +          ereport(ERROR, "the received timeline %d is not found in the 
> history file for timeline %d");
> 
> 
> > > Conclusion:
> > > - I think now we agree on the point that initializing expectedTLEs
> > > with the recovery target timeline is the right fix.
> > > - We still have some differences of opinion about what was the
> > > original problem in the base code which was fixed by the commit
> > > (ee994272ca50f70b53074f0febaec97e28f83c4e).
> > 
> > I am also still concerned about whether we understand in exactly what
> > cases the current logic doesn't work. We seem to more or less agree on
> > the fix, but I don't think we really understand precisely what case we
> > are fixing.
> 
> Does the discussion above make sense?


This is a revised version.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

>From 6664808340494dffc775e21cfa1029d6dfb8452e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota....@gmail.com>
Date: Fri, 21 May 2021 16:24:14 +0900
Subject: [PATCH v3] Set expectedTLEs correctly based on recoveryTargetTLI

When a standby started streaming before it determines expectedTLEs, it
is determined based on the timeline of the WAL segment just streamed
from the primary.  If the standby fetches the older timeline than
recoveryTargetTLI for thestarting checkpoint, this behavior leads to
setting expectedTLEs based on the older timeline than
recoveryTargetTLI.  It is inconsistent with the definition of the
variable and prevents later calls to rescanLatestTimeLine from
updating recoveryTargetTLI and expectedTLEs correctly. If that happens
the standby stops following the upstream.

This behavior has been introduced by commit ee994272ca but there seems
not a reason for the behavior and looks like a typo. Since the
relationship between the two timeline IDs is not verified until then,
also add a check for the relationship for safety's sake.
---
 src/backend/access/transam/xlog.c          | 35 ++++++++++----
 src/test/recovery/t/004_timeline_switch.pl | 55 +++++++++++++++++++++-
 2 files changed, 79 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 441a9124cd..fda729c63f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12659,22 +12659,37 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						 * info is set correctly and XLogReceiptTime isn't
 						 * changed.
 						 */
-						if (readFile < 0)
-						{
-							if (!expectedTLEs)
-								expectedTLEs = readTimeLineHistory(receiveTLI);
-							readFile = XLogFileRead(readSegNo, PANIC,
-													receiveTLI,
-													XLOG_FROM_STREAM, false);
-							Assert(readFile >= 0);
-						}
-						else
+						if (readFile >= 0)
 						{
 							/* just make sure source info is correct... */
 							readSource = XLOG_FROM_STREAM;
 							XLogReceiptSource = XLOG_FROM_STREAM;
 							return true;
 						}
+
+						readFile = XLogFileRead(readSegNo, PANIC,
+												receiveTLI,
+												XLOG_FROM_STREAM, false);
+						Assert(readFile >= 0);
+
+						if (expectedTLEs)
+							break;
+
+						expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
+
+						/*
+						 * We haven't checked the relationship beween the
+						 * receiveTLI and recoveryTargetTLI. Make sure that
+						 * receiveTLI is in the history for
+						 * recoveryTargetTLI. We don't need to do that
+						 * hearafter since recoveryTargetTLI and expectedTLEs
+						 * will be kept in sync.
+						 */
+						if (!tliOfPointInHistory(receiveTLI, expectedTLEs))
+							ereport(ERROR,
+									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+									 errmsg("hisotry file of timeline %d is incosistent with the current timeline %d",
+											recoveryTargetTLI, receiveTLI)));
 						break;
 					}
 
diff --git a/src/test/recovery/t/004_timeline_switch.pl b/src/test/recovery/t/004_timeline_switch.pl
index c101980e9e..357f3bf8fa 100644
--- a/src/test/recovery/t/004_timeline_switch.pl
+++ b/src/test/recovery/t/004_timeline_switch.pl
@@ -90,7 +90,14 @@ $node_primary_2->backup($backup_name);
 # Create standby node
 my $node_standby_3 = get_new_node('standby_3');
 $node_standby_3->init_from_backup($node_primary_2, $backup_name,
-	has_streaming => 1);
+	has_streaming => 1, allows_streaming => 1, has_archiving => 1);
+my $archivedir_standby_3 = $node_standby_3->archive_dir;
+$node_standby_3->append_conf(
+	'postgresql.conf', qq(
+archive_mode = always
+archive_command = 'cp "%p" "$archivedir_standby_3/%f"'
+log_line_prefix = '%m [%p:%b] %q%a '
+));
 
 # Restart primary node in standby mode and promote it, switching it
 # to a new timeline.
@@ -109,3 +116,49 @@ $node_primary_2->wait_for_catchup($node_standby_3, 'replay',
 my $result_2 =
   $node_standby_3->safe_psql('postgres', "SELECT count(*) FROM tab_int");
 is($result_2, qq(1), 'check content of standby 3');
+
+
+# Check that server can start from a backup took from a standby that
+# contains timeline switch midst of the first segment.
+
+# setup a standby connects to standby_3
+my $node_standby_4 = get_new_node('standby_4');
+$node_standby_4->init_from_backup($node_primary_2, $backup_name,
+								  has_streaming => 1);
+$node_standby_4->start;
+
+$node_primary_2->psql('postgres', qq[
+					  CREATE TABLE t (a int);
+					  SELECT pg_switch_wal();
+					  INSERT INTO t VALUES (0);
+					  CHECKPOINT;
+]);
+$node_primary_2->wait_for_catchup($node_standby_3, 'write',
+								$node_primary_2->lsn('insert'));
+$node_primary_2->wait_for_catchup($node_standby_4, 'write',
+								$node_primary_2->lsn('insert'));
+
+# promote standby_3 then connect standby_4 to standby_3. To enforce
+# archive fetching, fluash pg_wal files of the new standby.
+$node_standby_4->stop;
+$node_standby_3->psql('postgres', 'SELECT pg_promote()');
+$node_standby_4->enable_streaming($node_standby_3);
+$node_standby_4->append_conf(
+	'postgresql.conf',
+	qq[restore_command = 'cp "$archivedir_standby_3/%f" "%p"']);
+
+# clean up pg_wal on the second standby
+my $pgwaldir = $node_standby_4->data_dir. "/pg_wal";
+opendir my $dh, $pgwaldir or die "failed to open $pgwaldir";
+while (my $f = readdir($dh))
+{
+	unlink("$pgwaldir/$f") if (-f "$pgwaldir/$f");
+}
+closedir($dh);
+
+$node_standby_4->start;
+
+# check if replication is working
+$node_standby_3->psql('postgres', 'INSERT INTO t VALUES(1)');
+$node_standby_3->wait_for_catchup($node_standby_4, 'write',
+								  $node_standby_3->lsn('insert'));
-- 
2.27.0

Re: Race condition in recovery?

Reply via email to