Re: Infinite loop in XLogPageRead() on standby

Michael Paquier Tue, 14 Jan 2025 20:45:43 -0800

On Wed, Dec 25, 2024 at 12:00:59PM +0900, Michael Paquier wrote:
> All of them refer to an infinite loop reachable in the startup process
> when we read an incorrect incomplete record just after a failover or
> when a WAL receiver restarts.  Not sure which way is best in order to
> fix all of them in one shot, I am still considering what the possible
> options at hand are here.


And please find attached a set of two patches addressing the issue.
I've reviewed again the thread and after pondering quite a bit about
it I am on board with the point of making the early header check in
the page read callback less generic by restricting it to a WAL segment
boundary, as that's the original case we care about, to map also with 
the fact that WAL receivers are spawned with a LSN starting at the
beginning of segment, as suggested by Horiguchi-san.  As far as I can
see, I've misunderstood a bit things last week about the backpatch
point, sorry about that.

The new regression test is something I really want to keep around,
to be able to emulate the infinite loop, but I got annoyed with the
amount of duplication between the new test and the existing
039_end_of_wal.pl as there share the same ideas.  I have refactored
039_end_of_wal.pl and moved its routines to generate WAL messages,
to advance WAL and to write WAL into Cluster.pm, then reused the
refactored result in the new test.  I've also detected a race
condition in the new test, where we should wait for the logs of the
two standbys to report their "invalid magic number" failures.  The
test could fail if the replay is slow when scanning their logs.  There
is wait_for_log() for this job, available down to v13.

Barring objections, I'd like to get the main issue in 0002 fixed at
the beginning of next week.  I am also planning to do the refactoring
bits tomorrow or so as these are rather straight-forward.  The names
of the new routines in Cluster.pm are inherited from the existing
recovery test.  Perhaps they could use a better name, but 0001 does
not look that bad to me either.
--
Michael

From 4f25575aafdb8695e9884310df4af4fcb7ee256f Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@paquier.xyz>
Date: Wed, 15 Jan 2025 12:23:53 +0900
Subject: [PATCH 1/2] Move WAL write and advance routines into
 PostgreSQL::Test::Cluster

These facilities were originally in 039_end_of_wal.pl, and a follow-up
bug fix with a TAP test doing similar WAL manipulations requires them.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 146 +++++++++++++
 src/test/recovery/t/039_end_of_wal.pl    | 257 ++++++-----------------
 2 files changed, 212 insertions(+), 191 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 08b89a4cdf..30b7ab515d 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2954,6 +2954,152 @@ sub lsn
 
 =pod
 
+=item $node->write_wal($tli, $lsn, $segment_size, $data)
+
+Write some arbitrary data in WAL for the given segment at $lsn (in bytes).
+This should be called while the cluster is not running.
+
+Returns the path of the WAL segment written to.
+
+=cut
+
+sub write_wal
+{
+	my ($self, $tli, $lsn, $segment_size, $data) = @_;
+
+	# Calculate segment number and offset position in segment based on the
+	# input LSN.
+	my $segment = $lsn / $segment_size;
+	my $offset = $lsn % $segment_size;
+	my $path =
+	  sprintf("%s/pg_wal/%08X%08X%08X", $self->data_dir, $tli, 0, $segment);
+
+	open my $fh, "+<:raw", $path or die "could not open WAL segment $path";
+	seek($fh, $offset, SEEK_SET) or die "could not seek WAL segment $path";
+	print $fh $data;
+	close $fh;
+
+	return $path;
+}
+
+=pod
+
+=item $node->emit_message($size)
+
+Emit a WAL record of arbitrary size.
+
+Returns the end LSN of the record inserted, in bytes.
+
+=cut
+
+sub emit_message
+{
+	my ($self, $size) = @_;
+
+	return int(
+		$self->safe_psql(
+			'postgres',
+			"SELECT pg_logical_emit_message(true, '', repeat('a', $size)) - '0/0'"
+		));
+}
+
+
+# Private routine returning the current insert LSN of a node, in bytes.
+# Used by routines in charge of advancing WAL.
+sub _get_insert_lsn
+{
+	my ($self) = @_;
+	return int(
+		$self->safe_psql(
+			'postgres', "SELECT pg_current_wal_insert_lsn() - '0/0'"));
+}
+
+=pod
+
+=item $node->advance_wal_out_of_record_splitting_zone($wal_block_size)
+
+This inserts a few records of a fixed size, until the threshold gets close
+enough to the end of the WAL page inserting records to.  Make sure we are
+far away enough from the end of a page that we could insert a couple of
+small records.
+
+Returns the end LSN up to which WAL has advanced.
+
+=cut
+
+sub advance_wal_out_of_record_splitting_zone
+{
+	my ($self, $wal_block_size) = @_;
+
+	my $page_threshold = $wal_block_size / 4;
+	my $end_lsn = $self->_get_insert_lsn();
+	my $page_offset = $end_lsn % $wal_block_size;
+	while ($page_offset >= $wal_block_size - $page_threshold)
+	{
+		$self->emit_message($page_threshold);
+		$end_lsn = $self->_get_insert_lsn();
+		$page_offset = $end_lsn % $wal_block_size;
+	}
+	return $end_lsn;
+}
+
+=pod
+
+=item $node->advance_wal_to_record_splitting_zone($wal_block_size)
+
+Advance WAL so close to the end of a page that an XLogRecordHeader would not
+fit on it.
+
+Returns the end LSN up to which WAL has advanced.
+
+=cut
+
+sub advance_wal_to_record_splitting_zone
+{
+	my ($self, $wal_block_size) = @_;
+
+	# Size of record header.
+	my $RECORD_HEADER_SIZE = 24;
+
+	my $end_lsn = $self->_get_insert_lsn();
+	my $page_offset = $end_lsn % $wal_block_size;
+
+	# Get fairly close to the end of a page in big steps
+	while ($page_offset <= $wal_block_size - 512)
+	{
+		$self->emit_message($wal_block_size - $page_offset - 256);
+		$end_lsn = $self->_get_insert_lsn();
+		$page_offset = $end_lsn % $wal_block_size;
+	}
+
+	# Calibrate our message size so that we can get closer 8 bytes at
+	# a time.
+	my $message_size = $wal_block_size - 80;
+	while ($page_offset <= $wal_block_size - $RECORD_HEADER_SIZE)
+	{
+		$self->emit_message($message_size);
+		$end_lsn = $self->_get_insert_lsn();
+
+		my $old_offset = $page_offset;
+		$page_offset = $end_lsn % $wal_block_size;
+
+		# Adjust the message size until it causes 8 bytes changes in
+		# offset, enough to be able to split a record header.
+		my $delta = $page_offset - $old_offset;
+		if ($delta > 8)
+		{
+			$message_size -= 8;
+		}
+		elsif ($delta <= 0)
+		{
+			$message_size += 8;
+		}
+	}
+	return $end_lsn;
+}
+
+=pod
+
 =item $node->check_extension(extension_name)
 
 Scan pg_available_extensions to check that an extension is available in an
diff --git a/src/test/recovery/t/039_end_of_wal.pl b/src/test/recovery/t/039_end_of_wal.pl
index ab751eb271..9c8d5290ac 100644
--- a/src/test/recovery/t/039_end_of_wal.pl
+++ b/src/test/recovery/t/039_end_of_wal.pl
@@ -20,9 +20,6 @@ use integer;    # causes / operator to use integer math
 # we need to know the endianness to do that.
 my $BIG_ENDIAN = pack("L", 0x12345678) eq pack("N", 0x12345678);
 
-# Header size of record header.
-my $RECORD_HEADER_SIZE = 24;
-
 # Fields retrieved from code headers.
 my @scan_result = scan_server_header('access/xlog_internal.h',
 	'#define\s+XLOG_PAGE_MAGIC\s+(\w+)');
@@ -36,64 +33,6 @@ my $WAL_SEGMENT_SIZE;
 my $WAL_BLOCK_SIZE;
 my $TLI;
 
-# Build path of a WAL segment.
-sub wal_segment_path
-{
-	my $node = shift;
-	my $tli = shift;
-	my $segment = shift;
-	my $wal_path =
-	  sprintf("%s/pg_wal/%08X%08X%08X", $node->data_dir, $tli, 0, $segment);
-	return $wal_path;
-}
-
-# Calculate from a LSN (in bytes) its segment number and its offset.
-sub lsn_to_segment_and_offset
-{
-	my $lsn = shift;
-	return ($lsn / $WAL_SEGMENT_SIZE, $lsn % $WAL_SEGMENT_SIZE);
-}
-
-# Write some arbitrary data in WAL for the given segment at LSN.
-# This should be called while the cluster is not running.
-sub write_wal
-{
-	my $node = shift;
-	my $tli = shift;
-	my $lsn = shift;
-	my $data = shift;
-
-	my ($segment, $offset) = lsn_to_segment_and_offset($lsn);
-	my $path = wal_segment_path($node, $tli, $segment);
-
-	open my $fh, "+<:raw", $path or die;
-	seek($fh, $offset, SEEK_SET) or die;
-	print $fh $data;
-	close $fh;
-}
-
-# Emit a WAL record of arbitrary size.  Returns the end LSN of the
-# record inserted, in bytes.
-sub emit_message
-{
-	my $node = shift;
-	my $size = shift;
-	return int(
-		$node->safe_psql(
-			'postgres',
-			"SELECT pg_logical_emit_message(true, '', repeat('a', $size)) - '0/0'"
-		));
-}
-
-# Get the current insert LSN of a node, in bytes.
-sub get_insert_lsn
-{
-	my $node = shift;
-	return int(
-		$node->safe_psql(
-			'postgres', "SELECT pg_current_wal_insert_lsn() - '0/0'"));
-}
-
 # Get GUC value, converted to an int.
 sub get_int_setting
 {
@@ -167,69 +106,6 @@ sub build_page_header
 		$BIG_ENDIAN ? $xlp_pageaddr : 0, $xlp_rem_len);
 }
 
-# Make sure we are far away enough from the end of a page that we could insert
-# a couple of small records.  This inserts a few records of a fixed size, until
-# the threshold gets close enough to the end of the WAL page inserting records
-# to.
-sub advance_out_of_record_splitting_zone
-{
-	my $node = shift;
-
-	my $page_threshold = $WAL_BLOCK_SIZE / 4;
-	my $end_lsn = get_insert_lsn($node);
-	my $page_offset = $end_lsn % $WAL_BLOCK_SIZE;
-	while ($page_offset >= $WAL_BLOCK_SIZE - $page_threshold)
-	{
-		emit_message($node, $page_threshold);
-		$end_lsn = get_insert_lsn($node);
-		$page_offset = $end_lsn % $WAL_BLOCK_SIZE;
-	}
-	return $end_lsn;
-}
-
-# Advance so close to the end of a page that an XLogRecordHeader would not
-# fit on it.
-sub advance_to_record_splitting_zone
-{
-	my $node = shift;
-
-	my $end_lsn = get_insert_lsn($node);
-	my $page_offset = $end_lsn % $WAL_BLOCK_SIZE;
-
-	# Get fairly close to the end of a page in big steps
-	while ($page_offset <= $WAL_BLOCK_SIZE - 512)
-	{
-		emit_message($node, $WAL_BLOCK_SIZE - $page_offset - 256);
-		$end_lsn = get_insert_lsn($node);
-		$page_offset = $end_lsn % $WAL_BLOCK_SIZE;
-	}
-
-	# Calibrate our message size so that we can get closer 8 bytes at
-	# a time.
-	my $message_size = $WAL_BLOCK_SIZE - 80;
-	while ($page_offset <= $WAL_BLOCK_SIZE - $RECORD_HEADER_SIZE)
-	{
-		emit_message($node, $message_size);
-		$end_lsn = get_insert_lsn($node);
-
-		my $old_offset = $page_offset;
-		$page_offset = $end_lsn % $WAL_BLOCK_SIZE;
-
-		# Adjust the message size until it causes 8 bytes changes in
-		# offset, enough to be able to split a record header.
-		my $delta = $page_offset - $old_offset;
-		if ($delta > 8)
-		{
-			$message_size -= 8;
-		}
-		elsif ($delta <= 0)
-		{
-			$message_size += 8;
-		}
-	}
-	return $end_lsn;
-}
-
 # Setup a new node.  The configuration chosen here minimizes the number
 # of arbitrary records that could get generated in a cluster.  Enlarging
 # checkpoint_timeout avoids noise with checkpoint activity.  wal_level
@@ -265,8 +141,8 @@ note "Single-page end-of-WAL detection";
 ###########################################################################
 
 # xl_tot_len is 0 (a common case, we hit trailing zeroes).
-emit_message($node, 0);
-$end_lsn = advance_out_of_record_splitting_zone($node);
+$node->emit_message(0);
+$end_lsn = $node->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
 $node->stop('immediate');
 my $log_size = -s $node->logfile;
 $node->start;
@@ -276,10 +152,10 @@ ok( $node->log_contains(
 	"xl_tot_len zero");
 
 # xl_tot_len is < 24 (presumably recycled garbage).
-emit_message($node, 0);
-$end_lsn = advance_out_of_record_splitting_zone($node);
+$node->emit_message(0);
+$end_lsn = $node->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
 $node->stop('immediate');
-write_wal($node, $TLI, $end_lsn, build_record_header(23));
+$node->write_wal($TLI, $end_lsn, $WAL_SEGMENT_SIZE, build_record_header(23));
 $log_size = -s $node->logfile;
 $node->start;
 ok( $node->log_contains(
@@ -289,10 +165,10 @@ ok( $node->log_contains(
 
 # xl_tot_len in final position, not big enough to span into a new page but
 # also not eligible for regular record header validation
-emit_message($node, 0);
-$end_lsn = advance_to_record_splitting_zone($node);
+$node->emit_message(0);
+$end_lsn = $node->advance_wal_to_record_splitting_zone($WAL_BLOCK_SIZE);
 $node->stop('immediate');
-write_wal($node, $TLI, $end_lsn, build_record_header(1));
+$node->write_wal($TLI, $end_lsn, $WAL_SEGMENT_SIZE, build_record_header(1));
 $log_size = -s $node->logfile;
 $node->start;
 ok( $node->log_contains(
@@ -301,10 +177,10 @@ ok( $node->log_contains(
 	"xl_tot_len short at end-of-page");
 
 # Need more pages, but xl_prev check fails first.
-emit_message($node, 0);
-$end_lsn = advance_out_of_record_splitting_zone($node);
+$node->emit_message(0);
+$end_lsn = $node->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
 $node->stop('immediate');
-write_wal($node, $TLI, $end_lsn,
+$node->write_wal($TLI, $end_lsn, $WAL_SEGMENT_SIZE,
 	build_record_header(2 * 1024 * 1024 * 1024, 0, 0xdeadbeef));
 $log_size = -s $node->logfile;
 $node->start;
@@ -313,12 +189,12 @@ ok( $node->log_contains(
 	"xl_prev bad");
 
 # xl_crc check fails.
-emit_message($node, 0);
-advance_out_of_record_splitting_zone($node);
-$end_lsn = emit_message($node, 10);
+$node->emit_message(0);
+$node->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+$end_lsn = $node->emit_message(10);
 $node->stop('immediate');
 # Corrupt a byte in that record, breaking its CRC.
-write_wal($node, $TLI, $end_lsn - 8, '!');
+$node->write_wal($TLI, $end_lsn - 8, $WAL_SEGMENT_SIZE, '!');
 $log_size = -s $node->logfile;
 $node->start;
 ok( $node->log_contains(
@@ -335,11 +211,11 @@ note "Multi-page end-of-WAL detection, header is not split";
 # written to WAL.
 
 # Good xl_prev, we hit zero page next (zero magic).
-emit_message($node, 0);
-$prev_lsn = advance_out_of_record_splitting_zone($node);
-$end_lsn = emit_message($node, 0);
+$node->emit_message(0);
+$prev_lsn = $node->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+$end_lsn = $node->emit_message(0);
 $node->stop('immediate');
-write_wal($node, $TLI, $end_lsn,
+$node->write_wal($TLI, $end_lsn, $WAL_SEGMENT_SIZE,
 	build_record_header(2 * 1024 * 1024 * 1024, 0, $prev_lsn));
 $log_size = -s $node->logfile;
 $node->start;
@@ -347,16 +223,14 @@ ok($node->log_contains("invalid magic number 0000 .* LSN .*", $log_size),
 	"xlp_magic zero");
 
 # Good xl_prev, we hit garbage page next (bad magic).
-emit_message($node, 0);
-$prev_lsn = advance_out_of_record_splitting_zone($node);
-$end_lsn = emit_message($node, 0);
+$node->emit_message(0);
+$prev_lsn = $node->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+$end_lsn = $node->emit_message(0);
 $node->stop('immediate');
-write_wal($node, $TLI, $end_lsn,
+$node->write_wal($TLI, $end_lsn, $WAL_SEGMENT_SIZE,
 	build_record_header(2 * 1024 * 1024 * 1024, 0, $prev_lsn));
-write_wal(
-	$node, $TLI,
-	start_of_next_page($end_lsn),
-	build_page_header(0xcafe, 0, 1, 0));
+$node->write_wal($TLI, start_of_next_page($end_lsn),
+	$WAL_SEGMENT_SIZE, build_page_header(0xcafe, 0, 1, 0));
 $log_size = -s $node->logfile;
 $node->start;
 ok($node->log_contains("invalid magic number CAFE .* LSN .*", $log_size),
@@ -364,16 +238,14 @@ ok($node->log_contains("invalid magic number CAFE .* LSN .*", $log_size),
 
 # Good xl_prev, we hit typical recycled page (good xlp_magic, bad
 # xlp_pageaddr).
-emit_message($node, 0);
-$prev_lsn = advance_out_of_record_splitting_zone($node);
-$end_lsn = emit_message($node, 0);
+$node->emit_message(0);
+$prev_lsn = $node->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+$end_lsn = $node->emit_message(0);
 $node->stop('immediate');
-write_wal($node, $TLI, $end_lsn,
+$node->write_wal($TLI, $end_lsn, $WAL_SEGMENT_SIZE,
 	build_record_header(2 * 1024 * 1024 * 1024, 0, $prev_lsn));
-write_wal(
-	$node, $TLI,
-	start_of_next_page($end_lsn),
-	build_page_header($XLP_PAGE_MAGIC, 0, 1, 0xbaaaaaad));
+$node->write_wal($TLI, start_of_next_page($end_lsn),
+	$WAL_SEGMENT_SIZE, build_page_header($XLP_PAGE_MAGIC, 0, 1, 0xbaaaaaad));
 $log_size = -s $node->logfile;
 $node->start;
 ok( $node->log_contains(
@@ -381,15 +253,16 @@ ok( $node->log_contains(
 	"xlp_pageaddr bad");
 
 # Good xl_prev, xlp_magic, xlp_pageaddr, but bogus xlp_info.
-emit_message($node, 0);
-$prev_lsn = advance_out_of_record_splitting_zone($node);
-$end_lsn = emit_message($node, 0);
+$node->emit_message(0);
+$prev_lsn = $node->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+$end_lsn = $node->emit_message(0);
 $node->stop('immediate');
-write_wal($node, $TLI, $end_lsn,
+$node->write_wal($TLI, $end_lsn, $WAL_SEGMENT_SIZE,
 	build_record_header(2 * 1024 * 1024 * 1024, 42, $prev_lsn));
-write_wal(
-	$node, $TLI,
+$node->write_wal(
+	$TLI,
 	start_of_next_page($end_lsn),
+	$WAL_SEGMENT_SIZE,
 	build_page_header(
 		$XLP_PAGE_MAGIC, 0x1234, 1, start_of_next_page($end_lsn)));
 $log_size = -s $node->logfile;
@@ -399,15 +272,14 @@ ok($node->log_contains("invalid info bits 1234 in .*, LSN .*,", $log_size),
 
 # Good xl_prev, xlp_magic, xlp_pageaddr, but xlp_info doesn't mention
 # continuation record.
-emit_message($node, 0);
-$prev_lsn = advance_out_of_record_splitting_zone($node);
-$end_lsn = emit_message($node, 0);
+$node->emit_message(0);
+$prev_lsn = $node->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+$end_lsn = $node->emit_message(0);
 $node->stop('immediate');
-write_wal($node, $TLI, $end_lsn,
+$node->write_wal($TLI, $end_lsn, $WAL_SEGMENT_SIZE,
 	build_record_header(2 * 1024 * 1024 * 1024, 42, $prev_lsn));
-write_wal(
-	$node, $TLI,
-	start_of_next_page($end_lsn),
+$node->write_wal($TLI, start_of_next_page($end_lsn),
+	$WAL_SEGMENT_SIZE,
 	build_page_header($XLP_PAGE_MAGIC, 0, 1, start_of_next_page($end_lsn)));
 $log_size = -s $node->logfile;
 $node->start;
@@ -416,15 +288,16 @@ ok($node->log_contains("there is no contrecord flag at .*", $log_size),
 
 # Good xl_prev, xlp_magic, xlp_pageaddr, xlp_info but xlp_rem_len doesn't add
 # up.
-emit_message($node, 0);
-$prev_lsn = advance_out_of_record_splitting_zone($node);
-$end_lsn = emit_message($node, 0);
+$node->emit_message(0);
+$prev_lsn = $node->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+$end_lsn = $node->emit_message(0);
 $node->stop('immediate');
-write_wal($node, $TLI, $end_lsn,
+$node->write_wal($TLI, $end_lsn, $WAL_SEGMENT_SIZE,
 	build_record_header(2 * 1024 * 1024 * 1024, 42, $prev_lsn));
-write_wal(
-	$node, $TLI,
+$node->write_wal(
+	$TLI,
 	start_of_next_page($end_lsn),
+	$WAL_SEGMENT_SIZE,
 	build_page_header(
 		$XLP_PAGE_MAGIC, $XLP_FIRST_IS_CONTRECORD,
 		1, start_of_next_page($end_lsn),
@@ -441,10 +314,10 @@ note "Multi-page, but header is split, so page checks are done first";
 ###########################################################################
 
 # xl_prev is bad and xl_tot_len is too big, but we'll check xlp_magic first.
-emit_message($node, 0);
-$end_lsn = advance_to_record_splitting_zone($node);
+$node->emit_message(0);
+$end_lsn = $node->advance_wal_to_record_splitting_zone($WAL_BLOCK_SIZE);
 $node->stop('immediate');
-write_wal($node, $TLI, $end_lsn,
+$node->write_wal($TLI, $end_lsn, $WAL_SEGMENT_SIZE,
 	build_record_header(2 * 1024 * 1024 * 1024, 0, 0xdeadbeef));
 $log_size = -s $node->logfile;
 $node->start;
@@ -452,14 +325,15 @@ ok($node->log_contains("invalid magic number 0000 .* LSN .*", $log_size),
 	"xlp_magic zero (split record header)");
 
 # And we'll also check xlp_pageaddr before any header checks.
-emit_message($node, 0);
-$end_lsn = advance_to_record_splitting_zone($node);
+$node->emit_message(0);
+$end_lsn = $node->advance_wal_to_record_splitting_zone($WAL_BLOCK_SIZE);
 $node->stop('immediate');
-write_wal($node, $TLI, $end_lsn,
+$node->write_wal($TLI, $end_lsn, $WAL_SEGMENT_SIZE,
 	build_record_header(2 * 1024 * 1024 * 1024, 0, 0xdeadbeef));
-write_wal(
-	$node, $TLI,
+$node->write_wal(
+	$TLI,
 	start_of_next_page($end_lsn),
+	$WAL_SEGMENT_SIZE,
 	build_page_header(
 		$XLP_PAGE_MAGIC, $XLP_FIRST_IS_CONTRECORD, 1, 0xbaaaaaad));
 $log_size = -s $node->logfile;
@@ -470,14 +344,15 @@ ok( $node->log_contains(
 
 # We'll also discover that xlp_rem_len doesn't add up before any
 # header checks,
-emit_message($node, 0);
-$end_lsn = advance_to_record_splitting_zone($node);
+$node->emit_message(0);
+$end_lsn = $node->advance_wal_to_record_splitting_zone($WAL_BLOCK_SIZE);
 $node->stop('immediate');
-write_wal($node, $TLI, $end_lsn,
+$node->write_wal($TLI, $end_lsn, $WAL_SEGMENT_SIZE,
 	build_record_header(2 * 1024 * 1024 * 1024, 0, 0xdeadbeef));
-write_wal(
-	$node, $TLI,
+$node->write_wal(
+	$TLI,
 	start_of_next_page($end_lsn),
+	$WAL_SEGMENT_SIZE,
 	build_page_header(
 		$XLP_PAGE_MAGIC, $XLP_FIRST_IS_CONTRECORD,
 		1, start_of_next_page($end_lsn),
-- 
2.47.1

From d5fab0db45577f5b6a77b1119847fa9832b174c9 Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@paquier.xyz>
Date: Wed, 15 Jan 2025 13:26:47 +0900
Subject: [PATCH 2/2] Fix incorrect header check for continuation WAL records
 on standby

XLogPageRead() checks immediately for an invalid WAL record header on a
standby, to be able to handle the case of continuation records that need
to be read across two different sources.  As written, the check was too
generic, applying to any target LSN.  What really matters is to make sure
that the page header is checked when attempting to read a LSN at the
boundary of a segment, to handle the case of a continuation record that
spawns across multiple pages over multiple segments, as when WAL
receivers are spawned they request WAL from the beginning of a segment.

This could cause standbys to loop infinitely when dealing with a
continuation record during a timeline jump, in the case where the
contents of the record in the follow-up page are invalid.

A regression test is added, able to reproduce the problem, where the
contents of a continuation record are overwritten with junk.  This is
inspired by 039_end_of_wal.pl.
---
 src/backend/access/transam/xlogrecovery.c     |  13 +-
 src/test/recovery/meson.build                 |   1 +
 .../recovery/t/043_no_contrecord_switch.pl    | 153 ++++++++++++++++++
 3 files changed, 161 insertions(+), 6 deletions(-)
 create mode 100644 src/test/recovery/t/043_no_contrecord_switch.pl

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 0bbe2eea20..cf2b007806 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3438,12 +3438,12 @@ retry:
 	 * validates the page header anyway, and would propagate the failure up to
 	 * ReadRecord(), which would retry. However, there's a corner case with
 	 * continuation records, if a record is split across two pages such that
-	 * we would need to read the two pages from different sources. For
-	 * example, imagine a scenario where a streaming replica is started up,
-	 * and replay reaches a record that's split across two WAL segments. The
-	 * first page is only available locally, in pg_wal, because it's already
-	 * been recycled on the primary. The second page, however, is not present
-	 * in pg_wal, and we should stream it from the primary. There is a
+	 * we would need to read the two pages from different sources across two
+	 * WAL segments.
+	 *
+	 * The first page is only available locally, in pg_wal, because it's
+	 * already been recycled on the primary. The second page, however, is not
+	 * present in pg_wal, and we should stream it from the primary. There is a
 	 * recycled WAL segment present in pg_wal, with garbage contents, however.
 	 * We would read the first page from the local WAL segment, but when
 	 * reading the second page, we would read the bogus, recycled, WAL
@@ -3465,6 +3465,7 @@ retry:
 	 * responsible for the validation.
 	 */
 	if (StandbyMode &&
+		(targetPagePtr % wal_segment_size) == 0 &&
 		!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
 	{
 		/*
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 56c464abb7..0428704dbf 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_no_contrecord_switch.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_no_contrecord_switch.pl b/src/test/recovery/t/043_no_contrecord_switch.pl
new file mode 100644
index 0000000000..b5983a6c62
--- /dev/null
+++ b/src/test/recovery/t/043_no_contrecord_switch.pl
@@ -0,0 +1,153 @@
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+
+# Tests for already-propagated WAL segments ending in incomplete WAL records.
+
+use strict;
+use warnings;
+
+use File::Copy;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+use Fcntl qw(SEEK_SET);
+
+use integer;    # causes / operator to use integer math
+
+# Values queried from the server
+my $WAL_SEGMENT_SIZE;
+my $WAL_BLOCK_SIZE;
+my $TLI;
+
+# Build name of a WAL segment, used when filtering the contents of the server
+# logs.
+sub wal_segment_name
+{
+	my $tli = shift;
+	my $segment = shift;
+	return sprintf("%08X%08X%08X", $tli, 0, $segment);
+}
+
+# Calculate from a LSN (in bytes) its segment number and its offset, used
+# when filtering the contents of the server logs.
+sub lsn_to_segment_and_offset
+{
+	my $lsn = shift;
+	return ($lsn / $WAL_SEGMENT_SIZE, $lsn % $WAL_SEGMENT_SIZE);
+}
+
+# Get GUC value, converted to an int.
+sub get_int_setting
+{
+	my $node = shift;
+	my $name = shift;
+	return int(
+		$node->safe_psql(
+			'postgres',
+			"SELECT setting FROM pg_settings WHERE name = '$name'"));
+}
+
+sub start_of_page
+{
+	my $lsn = shift;
+	return $lsn & ~($WAL_BLOCK_SIZE - 1);
+}
+
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(allows_streaming => 1, has_archiving => 1);
+
+# The configuration is chosen here to minimize the friction with
+# concurrent WAL activity.  checkpoint_timeout avoids noise with
+# checkpoint activity, and autovacuum is disabled to avoid any
+# WAL activity generated by it.
+$primary->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = '30min'
+wal_keep_size = 1GB
+));
+
+$primary->start;
+$primary->backup('backup');
+
+$primary->safe_psql('postgres', "CREATE TABLE t AS SELECT 0");
+
+$WAL_SEGMENT_SIZE = get_int_setting($primary, 'wal_segment_size');
+$WAL_BLOCK_SIZE = get_int_setting($primary, 'wal_block_size');
+$TLI = $primary->safe_psql('postgres',
+	"SELECT timeline_id FROM pg_control_checkpoint()");
+
+# Get close to the end of the current WAL page, enough to fit the
+# beginning of a record that spans on two pages, generating a
+# continuation record.
+$primary->emit_message(0);
+my $end_lsn =
+  $primary->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+
+# Do some math to find the record size that will overflow the page, and
+# write it.
+my $overflow_size = $WAL_BLOCK_SIZE - ($end_lsn % $WAL_BLOCK_SIZE);
+$end_lsn = $primary->emit_message($overflow_size);
+$primary->stop('immediate');
+
+# Find the beginning of the page with the continuation record and fill
+# the entire page with zero bytes to simulate broken replication.
+my $start_page = start_of_page($end_lsn);
+my $wal_file = $primary->write_wal($TLI, $start_page, $WAL_SEGMENT_SIZE,
+	"\x00" x $WAL_BLOCK_SIZE);
+
+# Copy the file we just "hacked" to the archives.
+copy($wal_file, $primary->archive_dir);
+
+# Start standby nodes and make sure they replay the file "hacked" from
+# the archives.
+my $standby1 = PostgreSQL::Test::Cluster->new('standby1');
+$standby1->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $standby2 = PostgreSQL::Test::Cluster->new('standby2');
+$standby2->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $log_size1 = -s $standby1->logfile;
+my $log_size2 = -s $standby2->logfile;
+
+$standby1->start;
+$standby2->start;
+
+my ($segment, $offset) = lsn_to_segment_and_offset($start_page);
+my $segment_name = wal_segment_name($TLI, $segment);
+my $pattern =
+  qq(invalid magic number 0000 .* segment $segment_name.* offset $offset);
+
+# We expect both standby nodes to complain about empty page when trying to
+# assemble the record that spans over two pages, so wait for these in their
+# logs.
+$standby1->wait_for_log($pattern, $log_size1);
+$standby2->wait_for_log($pattern, $log_size2);
+
+# Now check the case of a promotion with a timeline jump handled at
+# page boundary with a continuation record.
+$standby1->promote;
+
+# This command forces standby2 to read a continuation record from the page
+# that is filled with zero bytes.
+$standby1->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Make sure WAL moves forward.
+$standby1->safe_psql('postgres',
+	'INSERT INTO t SELECT * FROM generate_series(1, 1000)');
+
+# Configure standby2 to stream from just promoted standby1 (it also pulls WAL
+# files from the archive).  It should be able to catch up.
+$standby2->enable_streaming($standby1);
+$standby2->reload;
+$standby1->wait_for_replay_catchup($standby2);
+
+my $result = $standby2->safe_psql('postgres', "SELECT count(*) FROM t");
+print "standby2: $result\n";
+is($result, qq(1001), 'check streamed content on standby2');
+
+done_testing();
-- 
2.47.1

signature.asc
Description: PGP signature

Re: Infinite loop in XLogPageRead() on standby

Reply via email to