Re: pg15b3: recovery fails with wal prefetch enabled

Thomas Munro Sun, 04 Sep 2022 18:29:09 -0700

On Fri, Sep 2, 2022 at 6:20 PM Thomas Munro <[email protected]> wrote:
> ... The active ingredient here is a setting of
> maintenance_io_concurency=0, which runs into a dumb accounting problem
> of the fencepost variety and incorrectly concludes it's reached the
> end early.  Setting it to 3 or higher allows his system to complete
> recovery.  I'm working on a fix ASAP.


The short version is that when tracking the number of IOs in progress,
I had two steps in the wrong order in the algorithm for figuring out
whether IO is saturated.  Internally, the effect of
maintenance_io_concurrency is clamped to 2 or more, and that mostly
hides the bug until you try to replay a particular sequence like
Justin's with such a low setting.  Without that clamp, and if you set
it to 1, then several of our recovery tests fail.

That clamp was a bad idea.  What I think we really want is for
maintenance_io_concurrency=0 to disable recovery prefetching exactly
as if you'd set recovery_prefetch=off, and any other setting including
1 to work without clamping.

Here's the patch I'm currently testing.  It also fixes a related
dangling reference problem with very small maintenance_io_concurrency.

I had this more or less figured out on Friday when I wrote last, but I
got stuck on a weird problem with 026_overwrite_contrecord.pl.  I
think that failure case should report an error, no?  I find it strange
that we end recovery in silence.  That was a problem for the new
coding in this patch, because it is confused by XLREAD_FAIL without
queuing an error, and then retries, which clobbers the aborted recptr
state.  I'm still looking into that.

From fabc519d39adcb65996031059dbf42c8e3623764 Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Mon, 5 Sep 2022 12:07:24 +1200
Subject: [PATCH] Fix recovery_prefetch with low maintenance_io_concurrency.

The LSN-based logic for knowing when IOs complete ran operations in the
wrong order.  We should process completed IOs first before trying to
start more, so that it is always possible to decode one more record when
the decoded record queue is empty, even if maintenance_io_concurrency is
set so low that a single earlier WAL record might have saturated the IO
queue.

That thinko was hidden because the effect of maintenance_io_concurrency
was arbitrarily clamped to be at least 2.  Fix the ordering, and also
remove that clamp.  We need a special case for 0, which is now treated
the same as recovery_prefetch=off, but otherwise the number is used
directly.  This allows for testing with 1, which would have made the
problem obvious in simple test scenarios.

Back-patch to 15.

Reported-by: Justin Pryzby <[email protected]>
Discussion: https://postgr.es/m/20220831140128.GS31833%40telsasoft.com
---
 src/backend/access/transam/xlogprefetcher.c | 58 ++++++++++++++-------
 src/backend/access/transam/xlogreader.c     |  4 ++
 2 files changed, 44 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
index 9aa56411d5..96038448cc 100644
--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -72,7 +72,9 @@
 int			recovery_prefetch = RECOVERY_PREFETCH_TRY;
 
 #ifdef USE_PREFETCH
-#define RecoveryPrefetchEnabled() (recovery_prefetch != RECOVERY_PREFETCH_OFF)
+#define RecoveryPrefetchEnabled() \
+		(recovery_prefetch != RECOVERY_PREFETCH_OFF && \
+		 maintenance_io_concurrency > 0)
 #else
 #define RecoveryPrefetchEnabled() false
 #endif
@@ -1000,7 +1002,8 @@ XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
 
 		if (RecoveryPrefetchEnabled())
 		{
-			max_inflight = Max(maintenance_io_concurrency, 2);
+			Assert(maintenance_io_concurrency > 0);
+			max_inflight = maintenance_io_concurrency;
 			max_distance = max_inflight * XLOGPREFETCHER_DISTANCE_MULTIPLIER;
 		}
 		else
@@ -1018,14 +1021,39 @@ XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
 	}
 
 	/*
-	 * Release last returned record, if there is one.  We need to do this so
-	 * that we can check for empty decode queue accurately.
+	 * Release last returned record, if there is one, and perform tasks that
+	 * are we can do now that the caller has replayed it.
 	 */
-	XLogReleasePreviousRecord(prefetcher->reader);
+	if (likely(record = prefetcher->reader->record))
+	{
+		XLogRecPtr	replayed_up_to = record->next_lsn;
+
+		XLogReleasePreviousRecord(prefetcher->reader);
+
+		/*
+		 * Can we drop any filters yet?  If we were waiting for a relation to
+		 * be created or extended, it is now OK to access blocks in the covered
+		 * range.
+		 */
+		XLogPrefetcherCompleteFilters(prefetcher, replayed_up_to);
+
+		/*
+		 * All IO initiated by earlier WAL is now completed.  This might
+		 * trigger further prefetching.
+		 */
+		lrq_complete_lsn(prefetcher->streaming_read, replayed_up_to);
+	}
 
-	/* If there's nothing queued yet, then start prefetching. */
+	/*
+	 * If there's nothing queued yet, then start prefetching to cause at least
+	 * one record to be queued.
+	 */
 	if (!XLogReaderHasQueuedRecordOrError(prefetcher->reader))
+	{
+		Assert(lrq_inflight(prefetcher->streaming_read) == 0);
+		Assert(lrq_completed(prefetcher->streaming_read) == 0);
 		lrq_prefetch(prefetcher->streaming_read);
+	}
 
 	/* Read the next record. */
 	record = XLogNextRecord(prefetcher->reader, errmsg);
@@ -1039,12 +1067,13 @@ XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
 	Assert(record == prefetcher->reader->record);
 
 	/*
-	 * Can we drop any prefetch filters yet, given the record we're about to
-	 * return?  This assumes that any records with earlier LSNs have been
-	 * replayed, so if we were waiting for a relation to be created or
-	 * extended, it is now OK to access blocks in the covered range.
+	 * If maintenance_io_concurrency is set very low, we might have started
+	 * prefetching some but not all of the blocks referenced in the record
+	 * we're about to return.  Forget about the rest of the blocks in this
+	 * record by dropping the prefetcher's reference to it.
 	 */
-	XLogPrefetcherCompleteFilters(prefetcher, record->lsn);
+	if (record == prefetcher->record)
+		prefetcher->record = NULL;
 
 	/*
 	 * See if it's time to compute some statistics, because enough WAL has
@@ -1053,13 +1082,6 @@ XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
 	if (unlikely(record->lsn >= prefetcher->next_stats_shm_lsn))
 		XLogPrefetcherComputeStats(prefetcher);
 
-	/*
-	 * The caller is about to replay this record, so we can now report that
-	 * all IO initiated because of early WAL must be finished. This may
-	 * trigger more readahead.
-	 */
-	lrq_complete_lsn(prefetcher->streaming_read, record->lsn);
-
 	Assert(record == prefetcher->reader->record);
 
 	return &record->header;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cdcacc7803..9adf7ab143 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -907,6 +907,10 @@ err:
 		 */
 		state->abortedRecPtr = RecPtr;
 		state->missingContrecPtr = targetPagePtr;
+		report_invalid_record(state,
+							  "aborted record pointer at %X/%X: missing contrecord at %X/%X",
+							  LSN_FORMAT_ARGS(state->abortedRecPtr),
+							  LSN_FORMAT_ARGS(state->missingContrecPtr));
 	}
 
 	if (decoded && decoded->oversized)
-- 
2.35.1

Re: pg15b3: recovery fails with wal prefetch enabled

Reply via email to