Re: Infinite loop in XLogPageRead() on standby

Michael Paquier Wed, 15 Jan 2025 16:43:24 -0800

On Wed, Jan 15, 2025 at 10:35:42AM +0100, Alexander Kukushkin wrote:
>  Thank you for picking it up. I briefly looked at both patches. The actual
> fix in XLogPageRead() looks good to me.
> I also agree with suggested refactoring, where there is certainly some room
> for improvement - $WAL_SEGMENT_SIZE, $WAL_BLOCK_SIZE variables and
> get_int_setting(), start_of_page() funcs are still duplicated in both test
> files.
> Maybe we can have something like the following in Cluster.pm? It'll allow
> further simplify tests and reduce code duplication.


Yeah, I was looking at that, but disliked a bit this option compared
to the routines doing WAL manipulations which are more complex with
their own ways of getting close to page limits, because I feel that we
should be smarter with the interfaces of these routines with more
options.  For example, we have other things scanning control file
data like system_identifier in 040_pg_createsubscriber.p, so we could
have a SQL that wraps around the pg_control_* functions with a custom
field, like we do for "sub lsn" in Cluster.pm.  Same comment for the
pg_settings queries, which are not limited to only what we do here.
The wrapper around integers is useful, of course, but we could make a
refactored routine apply a cast after checking its unit internally,
or something like that.

The routines for the start page position would not be fit within
Cluster.pm as it does not depend on a $node.  Perhaps Utils.pm?  Here
again, living with this small duplication felt OK in the scope of this
bug fix.  I'm of course open to tuning all that, though my primary
goal is to wrap the fix :D

I've applied the first refactoring bits down to v13 (see for example a
s/emit_message/emit_wal/ tweaked for consistency, with more comment
tweaks).  Attached are patches for each branch for the bug fix, that
I'm still testing and looking at more.  The readability of
043_no_contrecord_switch.pl looks rather OK here.
--
Michael

From 8167332c82691ff9f28286705b501ac5f3b4794f Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@paquier.xyz>
Date: Thu, 16 Jan 2025 08:05:09 +0900
Subject: [PATCH v2] Fix incorrect header check for continuation WAL records on
 standby

XLogPageRead() checks immediately for an invalid WAL record header on a
standby, to be able to handle the case of continuation records that need
to be read across two different sources.  As written, the check was too
generic, applying to any target LSN.  What really matters is to make sure
that the page header is checked when attempting to read a LSN at the
boundary of a segment, to handle the case of a continuation record that
spawns across multiple pages over multiple segments, as when WAL
receivers are spawned they request WAL from the beginning of a segment.

This could cause standbys to loop infinitely when dealing with a
continuation record during a timeline jump, in the case where the
contents of the record in the follow-up page are invalid.

A regression test is added, able to reproduce the problem, where the
contents of a continuation record are overwritten with junk.  This is
inspired by 039_end_of_wal.pl.
---
 src/backend/access/transam/xlogrecovery.c     |  13 +-
 src/test/recovery/meson.build                 |   1 +
 .../recovery/t/043_no_contrecord_switch.pl    | 153 ++++++++++++++++++
 3 files changed, 161 insertions(+), 6 deletions(-)
 create mode 100644 src/test/recovery/t/043_no_contrecord_switch.pl

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 0bbe2eea20..cf2b007806 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3438,12 +3438,12 @@ retry:
 	 * validates the page header anyway, and would propagate the failure up to
 	 * ReadRecord(), which would retry. However, there's a corner case with
 	 * continuation records, if a record is split across two pages such that
-	 * we would need to read the two pages from different sources. For
-	 * example, imagine a scenario where a streaming replica is started up,
-	 * and replay reaches a record that's split across two WAL segments. The
-	 * first page is only available locally, in pg_wal, because it's already
-	 * been recycled on the primary. The second page, however, is not present
-	 * in pg_wal, and we should stream it from the primary. There is a
+	 * we would need to read the two pages from different sources across two
+	 * WAL segments.
+	 *
+	 * The first page is only available locally, in pg_wal, because it's
+	 * already been recycled on the primary. The second page, however, is not
+	 * present in pg_wal, and we should stream it from the primary. There is a
 	 * recycled WAL segment present in pg_wal, with garbage contents, however.
 	 * We would read the first page from the local WAL segment, but when
 	 * reading the second page, we would read the bogus, recycled, WAL
@@ -3465,6 +3465,7 @@ retry:
 	 * responsible for the validation.
 	 */
 	if (StandbyMode &&
+		(targetPagePtr % wal_segment_size) == 0 &&
 		!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
 	{
 		/*
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 56c464abb7..0428704dbf 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_no_contrecord_switch.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_no_contrecord_switch.pl b/src/test/recovery/t/043_no_contrecord_switch.pl
new file mode 100644
index 0000000000..a473d3e7d3
--- /dev/null
+++ b/src/test/recovery/t/043_no_contrecord_switch.pl
@@ -0,0 +1,153 @@
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+
+# Tests for already-propagated WAL segments ending in incomplete WAL records.
+
+use strict;
+use warnings;
+
+use File::Copy;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+use Fcntl qw(SEEK_SET);
+
+use integer;    # causes / operator to use integer math
+
+# Values queried from the server
+my $WAL_SEGMENT_SIZE;
+my $WAL_BLOCK_SIZE;
+my $TLI;
+
+# Build name of a WAL segment, used when filtering the contents of the server
+# logs.
+sub wal_segment_name
+{
+	my $tli = shift;
+	my $segment = shift;
+	return sprintf("%08X%08X%08X", $tli, 0, $segment);
+}
+
+# Calculate from a LSN (in bytes) its segment number and its offset, used
+# when filtering the contents of the server logs.
+sub lsn_to_segment_and_offset
+{
+	my $lsn = shift;
+	return ($lsn / $WAL_SEGMENT_SIZE, $lsn % $WAL_SEGMENT_SIZE);
+}
+
+# Get GUC value, converted to an int.
+sub get_int_setting
+{
+	my $node = shift;
+	my $name = shift;
+	return int(
+		$node->safe_psql(
+			'postgres',
+			"SELECT setting FROM pg_settings WHERE name = '$name'"));
+}
+
+sub start_of_page
+{
+	my $lsn = shift;
+	return $lsn & ~($WAL_BLOCK_SIZE - 1);
+}
+
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(allows_streaming => 1, has_archiving => 1);
+
+# The configuration is chosen here to minimize the friction with
+# concurrent WAL activity.  checkpoint_timeout avoids noise with
+# checkpoint activity, and autovacuum is disabled to avoid any
+# WAL activity generated by it.
+$primary->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = '30min'
+wal_keep_size = 1GB
+));
+
+$primary->start;
+$primary->backup('backup');
+
+$primary->safe_psql('postgres', "CREATE TABLE t AS SELECT 0");
+
+$WAL_SEGMENT_SIZE = get_int_setting($primary, 'wal_segment_size');
+$WAL_BLOCK_SIZE = get_int_setting($primary, 'wal_block_size');
+$TLI = $primary->safe_psql('postgres',
+	"SELECT timeline_id FROM pg_control_checkpoint()");
+
+# Get close to the end of the current WAL page, enough to fit the
+# beginning of a record that spans on two pages, generating a
+# continuation record.
+$primary->emit_wal(0);
+my $end_lsn =
+  $primary->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+
+# Do some math to find the record size that will overflow the page, and
+# write it.
+my $overflow_size = $WAL_BLOCK_SIZE - ($end_lsn % $WAL_BLOCK_SIZE);
+$end_lsn = $primary->emit_wal($overflow_size);
+$primary->stop('immediate');
+
+# Find the beginning of the page with the continuation record and fill
+# the entire page with zero bytes to simulate broken replication.
+my $start_page = start_of_page($end_lsn);
+my $wal_file = $primary->write_wal($TLI, $start_page, $WAL_SEGMENT_SIZE,
+	"\x00" x $WAL_BLOCK_SIZE);
+
+# Copy the file we just "hacked" to the archives.
+copy($wal_file, $primary->archive_dir);
+
+# Start standby nodes and make sure they replay the file "hacked" from
+# the archives.
+my $standby1 = PostgreSQL::Test::Cluster->new('standby1');
+$standby1->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $standby2 = PostgreSQL::Test::Cluster->new('standby2');
+$standby2->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $log_size1 = -s $standby1->logfile;
+my $log_size2 = -s $standby2->logfile;
+
+$standby1->start;
+$standby2->start;
+
+my ($segment, $offset) = lsn_to_segment_and_offset($start_page);
+my $segment_name = wal_segment_name($TLI, $segment);
+my $pattern =
+  qq(invalid magic number 0000 .* segment $segment_name.* offset $offset);
+
+# We expect both standby nodes to complain about empty page when trying to
+# assemble the record that spans over two pages, so wait for these in their
+# logs.
+$standby1->wait_for_log($pattern, $log_size1);
+$standby2->wait_for_log($pattern, $log_size2);
+
+# Now check the case of a promotion with a timeline jump handled at
+# page boundary with a continuation record.
+$standby1->promote;
+
+# This command forces standby2 to read a continuation record from the page
+# that is filled with zero bytes.
+$standby1->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Make sure WAL moves forward.
+$standby1->safe_psql('postgres',
+	'INSERT INTO t SELECT * FROM generate_series(1, 1000)');
+
+# Configure standby2 to stream from just promoted standby1 (it also pulls WAL
+# files from the archive).  It should be able to catch up.
+$standby2->enable_streaming($standby1);
+$standby2->reload;
+$standby1->wait_for_replay_catchup($standby2);
+
+my $result = $standby2->safe_psql('postgres', "SELECT count(*) FROM t");
+print "standby2: $result\n";
+is($result, qq(1001), 'check streamed content on standby2');
+
+done_testing();
-- 
2.47.1

From a58c2c2d4a5cdd2f5c5cb1d2a8b301b7bddb2df7 Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@paquier.xyz>
Date: Thu, 16 Jan 2025 08:05:09 +0900
Subject: [PATCH v2] Fix incorrect header check for continuation WAL records on
 standby

XLogPageRead() checks immediately for an invalid WAL record header on a
standby, to be able to handle the case of continuation records that need
to be read across two different sources.  As written, the check was too
generic, applying to any target LSN.  What really matters is to make sure
that the page header is checked when attempting to read a LSN at the
boundary of a segment, to handle the case of a continuation record that
spawns across multiple pages over multiple segments, as when WAL
receivers are spawned they request WAL from the beginning of a segment.

This could cause standbys to loop infinitely when dealing with a
continuation record during a timeline jump, in the case where the
contents of the record in the follow-up page are invalid.

A regression test is added, able to reproduce the problem, where the
contents of a continuation record are overwritten with junk.  This is
inspired by 039_end_of_wal.pl.
---
 src/backend/access/transam/xlogrecovery.c     |  13 +-
 src/test/recovery/meson.build                 |   1 +
 .../recovery/t/043_no_contrecord_switch.pl    | 153 ++++++++++++++++++
 3 files changed, 161 insertions(+), 6 deletions(-)
 create mode 100644 src/test/recovery/t/043_no_contrecord_switch.pl

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index b45b833172..a94d4b4b78 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3436,12 +3436,12 @@ retry:
 	 * validates the page header anyway, and would propagate the failure up to
 	 * ReadRecord(), which would retry. However, there's a corner case with
 	 * continuation records, if a record is split across two pages such that
-	 * we would need to read the two pages from different sources. For
-	 * example, imagine a scenario where a streaming replica is started up,
-	 * and replay reaches a record that's split across two WAL segments. The
-	 * first page is only available locally, in pg_wal, because it's already
-	 * been recycled on the primary. The second page, however, is not present
-	 * in pg_wal, and we should stream it from the primary. There is a
+	 * we would need to read the two pages from different sources across two
+	 * WAL segments.
+	 *
+	 * The first page is only available locally, in pg_wal, because it's
+	 * already been recycled on the primary. The second page, however, is not
+	 * present in pg_wal, and we should stream it from the primary. There is a
 	 * recycled WAL segment present in pg_wal, with garbage contents, however.
 	 * We would read the first page from the local WAL segment, but when
 	 * reading the second page, we would read the bogus, recycled, WAL
@@ -3463,6 +3463,7 @@ retry:
 	 * responsible for the validation.
 	 */
 	if (StandbyMode &&
+		(targetPagePtr % wal_segment_size) == 0 &&
 		!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
 	{
 		/*
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec..7623cb1fe6 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_no_contrecord_switch.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_no_contrecord_switch.pl b/src/test/recovery/t/043_no_contrecord_switch.pl
new file mode 100644
index 0000000000..a473d3e7d3
--- /dev/null
+++ b/src/test/recovery/t/043_no_contrecord_switch.pl
@@ -0,0 +1,153 @@
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+
+# Tests for already-propagated WAL segments ending in incomplete WAL records.
+
+use strict;
+use warnings;
+
+use File::Copy;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+use Fcntl qw(SEEK_SET);
+
+use integer;    # causes / operator to use integer math
+
+# Values queried from the server
+my $WAL_SEGMENT_SIZE;
+my $WAL_BLOCK_SIZE;
+my $TLI;
+
+# Build name of a WAL segment, used when filtering the contents of the server
+# logs.
+sub wal_segment_name
+{
+	my $tli = shift;
+	my $segment = shift;
+	return sprintf("%08X%08X%08X", $tli, 0, $segment);
+}
+
+# Calculate from a LSN (in bytes) its segment number and its offset, used
+# when filtering the contents of the server logs.
+sub lsn_to_segment_and_offset
+{
+	my $lsn = shift;
+	return ($lsn / $WAL_SEGMENT_SIZE, $lsn % $WAL_SEGMENT_SIZE);
+}
+
+# Get GUC value, converted to an int.
+sub get_int_setting
+{
+	my $node = shift;
+	my $name = shift;
+	return int(
+		$node->safe_psql(
+			'postgres',
+			"SELECT setting FROM pg_settings WHERE name = '$name'"));
+}
+
+sub start_of_page
+{
+	my $lsn = shift;
+	return $lsn & ~($WAL_BLOCK_SIZE - 1);
+}
+
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(allows_streaming => 1, has_archiving => 1);
+
+# The configuration is chosen here to minimize the friction with
+# concurrent WAL activity.  checkpoint_timeout avoids noise with
+# checkpoint activity, and autovacuum is disabled to avoid any
+# WAL activity generated by it.
+$primary->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = '30min'
+wal_keep_size = 1GB
+));
+
+$primary->start;
+$primary->backup('backup');
+
+$primary->safe_psql('postgres', "CREATE TABLE t AS SELECT 0");
+
+$WAL_SEGMENT_SIZE = get_int_setting($primary, 'wal_segment_size');
+$WAL_BLOCK_SIZE = get_int_setting($primary, 'wal_block_size');
+$TLI = $primary->safe_psql('postgres',
+	"SELECT timeline_id FROM pg_control_checkpoint()");
+
+# Get close to the end of the current WAL page, enough to fit the
+# beginning of a record that spans on two pages, generating a
+# continuation record.
+$primary->emit_wal(0);
+my $end_lsn =
+  $primary->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+
+# Do some math to find the record size that will overflow the page, and
+# write it.
+my $overflow_size = $WAL_BLOCK_SIZE - ($end_lsn % $WAL_BLOCK_SIZE);
+$end_lsn = $primary->emit_wal($overflow_size);
+$primary->stop('immediate');
+
+# Find the beginning of the page with the continuation record and fill
+# the entire page with zero bytes to simulate broken replication.
+my $start_page = start_of_page($end_lsn);
+my $wal_file = $primary->write_wal($TLI, $start_page, $WAL_SEGMENT_SIZE,
+	"\x00" x $WAL_BLOCK_SIZE);
+
+# Copy the file we just "hacked" to the archives.
+copy($wal_file, $primary->archive_dir);
+
+# Start standby nodes and make sure they replay the file "hacked" from
+# the archives.
+my $standby1 = PostgreSQL::Test::Cluster->new('standby1');
+$standby1->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $standby2 = PostgreSQL::Test::Cluster->new('standby2');
+$standby2->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $log_size1 = -s $standby1->logfile;
+my $log_size2 = -s $standby2->logfile;
+
+$standby1->start;
+$standby2->start;
+
+my ($segment, $offset) = lsn_to_segment_and_offset($start_page);
+my $segment_name = wal_segment_name($TLI, $segment);
+my $pattern =
+  qq(invalid magic number 0000 .* segment $segment_name.* offset $offset);
+
+# We expect both standby nodes to complain about empty page when trying to
+# assemble the record that spans over two pages, so wait for these in their
+# logs.
+$standby1->wait_for_log($pattern, $log_size1);
+$standby2->wait_for_log($pattern, $log_size2);
+
+# Now check the case of a promotion with a timeline jump handled at
+# page boundary with a continuation record.
+$standby1->promote;
+
+# This command forces standby2 to read a continuation record from the page
+# that is filled with zero bytes.
+$standby1->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Make sure WAL moves forward.
+$standby1->safe_psql('postgres',
+	'INSERT INTO t SELECT * FROM generate_series(1, 1000)');
+
+# Configure standby2 to stream from just promoted standby1 (it also pulls WAL
+# files from the archive).  It should be able to catch up.
+$standby2->enable_streaming($standby1);
+$standby2->reload;
+$standby1->wait_for_replay_catchup($standby2);
+
+my $result = $standby2->safe_psql('postgres', "SELECT count(*) FROM t");
+print "standby2: $result\n";
+is($result, qq(1001), 'check streamed content on standby2');
+
+done_testing();
-- 
2.47.1

From 1bbffa37813cfc39ba323d255ee07ea9f87ac189 Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@paquier.xyz>
Date: Thu, 16 Jan 2025 08:05:09 +0900
Subject: [PATCH v2] Fix incorrect header check for continuation WAL records on
 standby

XLogPageRead() checks immediately for an invalid WAL record header on a
standby, to be able to handle the case of continuation records that need
to be read across two different sources.  As written, the check was too
generic, applying to any target LSN.  What really matters is to make sure
that the page header is checked when attempting to read a LSN at the
boundary of a segment, to handle the case of a continuation record that
spawns across multiple pages over multiple segments, as when WAL
receivers are spawned they request WAL from the beginning of a segment.

This could cause standbys to loop infinitely when dealing with a
continuation record during a timeline jump, in the case where the
contents of the record in the follow-up page are invalid.

A regression test is added, able to reproduce the problem, where the
contents of a continuation record are overwritten with junk.  This is
inspired by 039_end_of_wal.pl.
---
 src/backend/access/transam/xlogrecovery.c     |  13 +-
 src/test/recovery/meson.build                 |   1 +
 .../recovery/t/043_no_contrecord_switch.pl    | 153 ++++++++++++++++++
 3 files changed, 161 insertions(+), 6 deletions(-)
 create mode 100644 src/test/recovery/t/043_no_contrecord_switch.pl

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 3c7fb913e7..6d48cb7e84 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3386,12 +3386,12 @@ retry:
 	 * validates the page header anyway, and would propagate the failure up to
 	 * ReadRecord(), which would retry. However, there's a corner case with
 	 * continuation records, if a record is split across two pages such that
-	 * we would need to read the two pages from different sources. For
-	 * example, imagine a scenario where a streaming replica is started up,
-	 * and replay reaches a record that's split across two WAL segments. The
-	 * first page is only available locally, in pg_wal, because it's already
-	 * been recycled on the primary. The second page, however, is not present
-	 * in pg_wal, and we should stream it from the primary. There is a
+	 * we would need to read the two pages from different sources across two
+	 * WAL segments.
+	 *
+	 * The first page is only available locally, in pg_wal, because it's
+	 * already been recycled on the primary. The second page, however, is not
+	 * present in pg_wal, and we should stream it from the primary. There is a
 	 * recycled WAL segment present in pg_wal, with garbage contents, however.
 	 * We would read the first page from the local WAL segment, but when
 	 * reading the second page, we would read the bogus, recycled, WAL
@@ -3413,6 +3413,7 @@ retry:
 	 * responsible for the validation.
 	 */
 	if (StandbyMode &&
+		(targetPagePtr % wal_segment_size) == 0 &&
 		!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
 	{
 		/*
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index f5c1021a5e..24d0f5b0c1 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -45,6 +45,7 @@ tests += {
       't/037_invalid_database.pl',
       't/039_end_of_wal.pl',
       't/043_vacuum_horizon_floor.pl',
+      't/043_no_contrecord_switch.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_no_contrecord_switch.pl b/src/test/recovery/t/043_no_contrecord_switch.pl
new file mode 100644
index 0000000000..a473d3e7d3
--- /dev/null
+++ b/src/test/recovery/t/043_no_contrecord_switch.pl
@@ -0,0 +1,153 @@
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+
+# Tests for already-propagated WAL segments ending in incomplete WAL records.
+
+use strict;
+use warnings;
+
+use File::Copy;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+use Fcntl qw(SEEK_SET);
+
+use integer;    # causes / operator to use integer math
+
+# Values queried from the server
+my $WAL_SEGMENT_SIZE;
+my $WAL_BLOCK_SIZE;
+my $TLI;
+
+# Build name of a WAL segment, used when filtering the contents of the server
+# logs.
+sub wal_segment_name
+{
+	my $tli = shift;
+	my $segment = shift;
+	return sprintf("%08X%08X%08X", $tli, 0, $segment);
+}
+
+# Calculate from a LSN (in bytes) its segment number and its offset, used
+# when filtering the contents of the server logs.
+sub lsn_to_segment_and_offset
+{
+	my $lsn = shift;
+	return ($lsn / $WAL_SEGMENT_SIZE, $lsn % $WAL_SEGMENT_SIZE);
+}
+
+# Get GUC value, converted to an int.
+sub get_int_setting
+{
+	my $node = shift;
+	my $name = shift;
+	return int(
+		$node->safe_psql(
+			'postgres',
+			"SELECT setting FROM pg_settings WHERE name = '$name'"));
+}
+
+sub start_of_page
+{
+	my $lsn = shift;
+	return $lsn & ~($WAL_BLOCK_SIZE - 1);
+}
+
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(allows_streaming => 1, has_archiving => 1);
+
+# The configuration is chosen here to minimize the friction with
+# concurrent WAL activity.  checkpoint_timeout avoids noise with
+# checkpoint activity, and autovacuum is disabled to avoid any
+# WAL activity generated by it.
+$primary->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = '30min'
+wal_keep_size = 1GB
+));
+
+$primary->start;
+$primary->backup('backup');
+
+$primary->safe_psql('postgres', "CREATE TABLE t AS SELECT 0");
+
+$WAL_SEGMENT_SIZE = get_int_setting($primary, 'wal_segment_size');
+$WAL_BLOCK_SIZE = get_int_setting($primary, 'wal_block_size');
+$TLI = $primary->safe_psql('postgres',
+	"SELECT timeline_id FROM pg_control_checkpoint()");
+
+# Get close to the end of the current WAL page, enough to fit the
+# beginning of a record that spans on two pages, generating a
+# continuation record.
+$primary->emit_wal(0);
+my $end_lsn =
+  $primary->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+
+# Do some math to find the record size that will overflow the page, and
+# write it.
+my $overflow_size = $WAL_BLOCK_SIZE - ($end_lsn % $WAL_BLOCK_SIZE);
+$end_lsn = $primary->emit_wal($overflow_size);
+$primary->stop('immediate');
+
+# Find the beginning of the page with the continuation record and fill
+# the entire page with zero bytes to simulate broken replication.
+my $start_page = start_of_page($end_lsn);
+my $wal_file = $primary->write_wal($TLI, $start_page, $WAL_SEGMENT_SIZE,
+	"\x00" x $WAL_BLOCK_SIZE);
+
+# Copy the file we just "hacked" to the archives.
+copy($wal_file, $primary->archive_dir);
+
+# Start standby nodes and make sure they replay the file "hacked" from
+# the archives.
+my $standby1 = PostgreSQL::Test::Cluster->new('standby1');
+$standby1->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $standby2 = PostgreSQL::Test::Cluster->new('standby2');
+$standby2->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $log_size1 = -s $standby1->logfile;
+my $log_size2 = -s $standby2->logfile;
+
+$standby1->start;
+$standby2->start;
+
+my ($segment, $offset) = lsn_to_segment_and_offset($start_page);
+my $segment_name = wal_segment_name($TLI, $segment);
+my $pattern =
+  qq(invalid magic number 0000 .* segment $segment_name.* offset $offset);
+
+# We expect both standby nodes to complain about empty page when trying to
+# assemble the record that spans over two pages, so wait for these in their
+# logs.
+$standby1->wait_for_log($pattern, $log_size1);
+$standby2->wait_for_log($pattern, $log_size2);
+
+# Now check the case of a promotion with a timeline jump handled at
+# page boundary with a continuation record.
+$standby1->promote;
+
+# This command forces standby2 to read a continuation record from the page
+# that is filled with zero bytes.
+$standby1->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Make sure WAL moves forward.
+$standby1->safe_psql('postgres',
+	'INSERT INTO t SELECT * FROM generate_series(1, 1000)');
+
+# Configure standby2 to stream from just promoted standby1 (it also pulls WAL
+# files from the archive).  It should be able to catch up.
+$standby2->enable_streaming($standby1);
+$standby2->reload;
+$standby1->wait_for_replay_catchup($standby2);
+
+my $result = $standby2->safe_psql('postgres', "SELECT count(*) FROM t");
+print "standby2: $result\n";
+is($result, qq(1001), 'check streamed content on standby2');
+
+done_testing();
-- 
2.47.1

From 234a62d98a8205b6e1e3fe0c2f9a7e61cf08678d Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@paquier.xyz>
Date: Thu, 16 Jan 2025 08:05:09 +0900
Subject: [PATCH v2] Fix incorrect header check for continuation WAL records on
 standby

XLogPageRead() checks immediately for an invalid WAL record header on a
standby, to be able to handle the case of continuation records that need
to be read across two different sources.  As written, the check was too
generic, applying to any target LSN.  What really matters is to make sure
that the page header is checked when attempting to read a LSN at the
boundary of a segment, to handle the case of a continuation record that
spawns across multiple pages over multiple segments, as when WAL
receivers are spawned they request WAL from the beginning of a segment.

This could cause standbys to loop infinitely when dealing with a
continuation record during a timeline jump, in the case where the
contents of the record in the follow-up page are invalid.

A regression test is added, able to reproduce the problem, where the
contents of a continuation record are overwritten with junk.  This is
inspired by 039_end_of_wal.pl.
---
 src/backend/access/transam/xlogrecovery.c     |  13 +-
 .../recovery/t/043_no_contrecord_switch.pl    | 153 ++++++++++++++++++
 2 files changed, 160 insertions(+), 6 deletions(-)
 create mode 100644 src/test/recovery/t/043_no_contrecord_switch.pl

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 1503b21671..bbc19df192 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3361,12 +3361,12 @@ retry:
 	 * validates the page header anyway, and would propagate the failure up to
 	 * ReadRecord(), which would retry. However, there's a corner case with
 	 * continuation records, if a record is split across two pages such that
-	 * we would need to read the two pages from different sources. For
-	 * example, imagine a scenario where a streaming replica is started up,
-	 * and replay reaches a record that's split across two WAL segments. The
-	 * first page is only available locally, in pg_wal, because it's already
-	 * been recycled on the primary. The second page, however, is not present
-	 * in pg_wal, and we should stream it from the primary. There is a
+	 * we would need to read the two pages from different sources across two
+	 * WAL segments.
+	 *
+	 * The first page is only available locally, in pg_wal, because it's
+	 * already been recycled on the primary. The second page, however, is not
+	 * present in pg_wal, and we should stream it from the primary. There is a
 	 * recycled WAL segment present in pg_wal, with garbage contents, however.
 	 * We would read the first page from the local WAL segment, but when
 	 * reading the second page, we would read the bogus, recycled, WAL
@@ -3388,6 +3388,7 @@ retry:
 	 * responsible for the validation.
 	 */
 	if (StandbyMode &&
+		(targetPagePtr % wal_segment_size) == 0 &&
 		!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
 	{
 		/*
diff --git a/src/test/recovery/t/043_no_contrecord_switch.pl b/src/test/recovery/t/043_no_contrecord_switch.pl
new file mode 100644
index 0000000000..14c3ac3d0e
--- /dev/null
+++ b/src/test/recovery/t/043_no_contrecord_switch.pl
@@ -0,0 +1,153 @@
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+
+# Tests for already-propagated WAL segments ending in incomplete WAL records.
+
+use strict;
+use warnings;
+
+use File::Copy;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+use Fcntl qw(SEEK_SET);
+
+use integer;    # causes / operator to use integer math
+
+# Values queried from the server
+my $WAL_SEGMENT_SIZE;
+my $WAL_BLOCK_SIZE;
+my $TLI;
+
+# Build name of a WAL segment, used when filtering the contents of the server
+# logs.
+sub wal_segment_name
+{
+	my $tli = shift;
+	my $segment = shift;
+	return sprintf("%08X%08X%08X", $tli, 0, $segment);
+}
+
+# Calculate from a LSN (in bytes) its segment number and its offset, used
+# when filtering the contents of the server logs.
+sub lsn_to_segment_and_offset
+{
+	my $lsn = shift;
+	return ($lsn / $WAL_SEGMENT_SIZE, $lsn % $WAL_SEGMENT_SIZE);
+}
+
+# Get GUC value, converted to an int.
+sub get_int_setting
+{
+	my $node = shift;
+	my $name = shift;
+	return int(
+		$node->safe_psql(
+			'postgres',
+			"SELECT setting FROM pg_settings WHERE name = '$name'"));
+}
+
+sub start_of_page
+{
+	my $lsn = shift;
+	return $lsn & ~($WAL_BLOCK_SIZE - 1);
+}
+
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(allows_streaming => 1, has_archiving => 1);
+
+# The configuration is chosen here to minimize the friction with
+# concurrent WAL activity.  checkpoint_timeout avoids noise with
+# checkpoint activity, and autovacuum is disabled to avoid any
+# WAL activity generated by it.
+$primary->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = '30min'
+wal_keep_size = 1GB
+));
+
+$primary->start;
+$primary->backup('backup');
+
+$primary->safe_psql('postgres', "CREATE TABLE t AS SELECT 0");
+
+$WAL_SEGMENT_SIZE = get_int_setting($primary, 'wal_segment_size');
+$WAL_BLOCK_SIZE = get_int_setting($primary, 'wal_block_size');
+$TLI = $primary->safe_psql('postgres',
+	"SELECT timeline_id FROM pg_control_checkpoint()");
+
+# Get close to the end of the current WAL page, enough to fit the
+# beginning of a record that spans on two pages, generating a
+# continuation record.
+$primary->emit_wal(0);
+my $end_lsn =
+  $primary->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+
+# Do some math to find the record size that will overflow the page, and
+# write it.
+my $overflow_size = $WAL_BLOCK_SIZE - ($end_lsn % $WAL_BLOCK_SIZE);
+$end_lsn = $primary->emit_wal($overflow_size);
+$primary->stop('immediate');
+
+# Find the beginning of the page with the continuation record and fill
+# the entire page with zero bytes to simulate broken replication.
+my $start_page = start_of_page($end_lsn);
+my $wal_file = $primary->write_wal($TLI, $start_page, $WAL_SEGMENT_SIZE,
+	"\x00" x $WAL_BLOCK_SIZE);
+
+# Copy the file we just "hacked" to the archives.
+copy($wal_file, $primary->archive_dir);
+
+# Start standby nodes and make sure they replay the file "hacked" from
+# the archives.
+my $standby1 = PostgreSQL::Test::Cluster->new('standby1');
+$standby1->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $standby2 = PostgreSQL::Test::Cluster->new('standby2');
+$standby2->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $log_size1 = -s $standby1->logfile;
+my $log_size2 = -s $standby2->logfile;
+
+$standby1->start;
+$standby2->start;
+
+my ($segment, $offset) = lsn_to_segment_and_offset($start_page);
+my $segment_name = wal_segment_name($TLI, $segment);
+my $pattern =
+  qq(invalid magic number 0000 .* segment $segment_name.* offset $offset);
+
+# We expect both standby nodes to complain about empty page when trying to
+# assemble the record that spans over two pages, so wait for these in their
+# logs.
+$standby1->wait_for_log($pattern, $log_size1);
+$standby2->wait_for_log($pattern, $log_size2);
+
+# Now check the case of a promotion with a timeline jump handled at
+# page boundary with a continuation record.
+$standby1->promote;
+
+# This command forces standby2 to read a continuation record from the page
+# that is filled with zero bytes.
+$standby1->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Make sure WAL moves forward.
+$standby1->safe_psql('postgres',
+	'INSERT INTO t SELECT * FROM generate_series(1, 1000)');
+
+# Configure standby2 to stream from just promoted standby1 (it also pulls WAL
+# files from the archive).  It should be able to catch up.
+$standby2->enable_streaming($standby1);
+$standby2->reload;
+$standby1->wait_for_catchup('standby2', 'replay', $standby1->lsn('flush'));
+
+my $result = $standby2->safe_psql('postgres', "SELECT count(*) FROM t");
+print "standby2: $result\n";
+is($result, qq(1001), 'check streamed content on standby2');
+
+done_testing();
-- 
2.47.1

From 7422ad5a702773867fbab7d3becaeca1f460488c Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@paquier.xyz>
Date: Thu, 16 Jan 2025 08:43:13 +0900
Subject: [PATCH v2] Fix incorrect header check for continuation WAL records on
 standby

XLogPageRead() checks immediately for an invalid WAL record header on a
standby, to be able to handle the case of continuation records that need
to be read across two different sources.  As written, the check was too
generic, applying to any target LSN.  What really matters is to make sure
that the page header is checked when attempting to read a LSN at the
boundary of a segment, to handle the case of a continuation record that
spawns across multiple pages over multiple segments, as when WAL
receivers are spawned they request WAL from the beginning of a segment.

This could cause standbys to loop infinitely when dealing with a
continuation record during a timeline jump, in the case where the
contents of the record in the follow-up page are invalid.

A regression test is added, able to reproduce the problem, where the
contents of a continuation record are overwritten with junk.  This is
inspired by 039_end_of_wal.pl.
---
 src/backend/access/transam/xlog.c             |  24 +--
 .../recovery/t/043_no_contrecord_switch.pl    | 153 ++++++++++++++++++
 2 files changed, 165 insertions(+), 12 deletions(-)
 create mode 100644 src/test/recovery/t/043_no_contrecord_switch.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a6e2cb88f3..0334c2d3bb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12562,18 +12562,17 @@ retry:
 	 * validates the page header anyway, and would propagate the failure up to
 	 * ReadRecord(), which would retry. However, there's a corner case with
 	 * continuation records, if a record is split across two pages such that
-	 * we would need to read the two pages from different sources. For
-	 * example, imagine a scenario where a streaming replica is started up,
-	 * and replay reaches a record that's split across two WAL segments. The
-	 * first page is only available locally, in pg_wal, because it's already
-	 * been recycled on the primary. The second page, however, is not present
-	 * in pg_wal, and we should stream it from the primary. There is a
-	 * recycled WAL segment present in pg_wal, with garbage contents, however.
-	 * We would read the first page from the local WAL segment, but when
-	 * reading the second page, we would read the bogus, recycled, WAL
-	 * segment. If we didn't catch that case here, we would never recover,
-	 * because ReadRecord() would retry reading the whole record from the
-	 * beginning.
+	 * we would need to read the two pages from different sources across two
+	 * WAL segments.
+	 *
+	 * The first page is only available locally, in pg_wal, because it's
+	 * already been recycled on the primary. The second page, however, is not
+	 * present in pg_wal, and we should stream it from the primary. There is a
+	 * WAL segment present in pg_wal, with garbage contents, however. We would
+	 * read the first page from the local WAL segment, but when reading the
+	 * second page, we would read the bogus, recycled, WAL segment. If we
+	 * didn't catch that case here, we would never recover, because
+	 * ReadRecord() would retry reading the whole record from the beginning.
 	 *
 	 * Of course, this only catches errors in the page header, which is what
 	 * happens in the case of a recycled WAL segment. Other kinds of errors or
@@ -12589,6 +12588,7 @@ retry:
 	 * responsible for the validation.
 	 */
 	if (StandbyMode &&
+		(targetPagePtr % wal_segment_size) == 0 &&
 		!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
 	{
 		/*
diff --git a/src/test/recovery/t/043_no_contrecord_switch.pl b/src/test/recovery/t/043_no_contrecord_switch.pl
new file mode 100644
index 0000000000..14c3ac3d0e
--- /dev/null
+++ b/src/test/recovery/t/043_no_contrecord_switch.pl
@@ -0,0 +1,153 @@
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+
+# Tests for already-propagated WAL segments ending in incomplete WAL records.
+
+use strict;
+use warnings;
+
+use File::Copy;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+use Fcntl qw(SEEK_SET);
+
+use integer;    # causes / operator to use integer math
+
+# Values queried from the server
+my $WAL_SEGMENT_SIZE;
+my $WAL_BLOCK_SIZE;
+my $TLI;
+
+# Build name of a WAL segment, used when filtering the contents of the server
+# logs.
+sub wal_segment_name
+{
+	my $tli = shift;
+	my $segment = shift;
+	return sprintf("%08X%08X%08X", $tli, 0, $segment);
+}
+
+# Calculate from a LSN (in bytes) its segment number and its offset, used
+# when filtering the contents of the server logs.
+sub lsn_to_segment_and_offset
+{
+	my $lsn = shift;
+	return ($lsn / $WAL_SEGMENT_SIZE, $lsn % $WAL_SEGMENT_SIZE);
+}
+
+# Get GUC value, converted to an int.
+sub get_int_setting
+{
+	my $node = shift;
+	my $name = shift;
+	return int(
+		$node->safe_psql(
+			'postgres',
+			"SELECT setting FROM pg_settings WHERE name = '$name'"));
+}
+
+sub start_of_page
+{
+	my $lsn = shift;
+	return $lsn & ~($WAL_BLOCK_SIZE - 1);
+}
+
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(allows_streaming => 1, has_archiving => 1);
+
+# The configuration is chosen here to minimize the friction with
+# concurrent WAL activity.  checkpoint_timeout avoids noise with
+# checkpoint activity, and autovacuum is disabled to avoid any
+# WAL activity generated by it.
+$primary->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = '30min'
+wal_keep_size = 1GB
+));
+
+$primary->start;
+$primary->backup('backup');
+
+$primary->safe_psql('postgres', "CREATE TABLE t AS SELECT 0");
+
+$WAL_SEGMENT_SIZE = get_int_setting($primary, 'wal_segment_size');
+$WAL_BLOCK_SIZE = get_int_setting($primary, 'wal_block_size');
+$TLI = $primary->safe_psql('postgres',
+	"SELECT timeline_id FROM pg_control_checkpoint()");
+
+# Get close to the end of the current WAL page, enough to fit the
+# beginning of a record that spans on two pages, generating a
+# continuation record.
+$primary->emit_wal(0);
+my $end_lsn =
+  $primary->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+
+# Do some math to find the record size that will overflow the page, and
+# write it.
+my $overflow_size = $WAL_BLOCK_SIZE - ($end_lsn % $WAL_BLOCK_SIZE);
+$end_lsn = $primary->emit_wal($overflow_size);
+$primary->stop('immediate');
+
+# Find the beginning of the page with the continuation record and fill
+# the entire page with zero bytes to simulate broken replication.
+my $start_page = start_of_page($end_lsn);
+my $wal_file = $primary->write_wal($TLI, $start_page, $WAL_SEGMENT_SIZE,
+	"\x00" x $WAL_BLOCK_SIZE);
+
+# Copy the file we just "hacked" to the archives.
+copy($wal_file, $primary->archive_dir);
+
+# Start standby nodes and make sure they replay the file "hacked" from
+# the archives.
+my $standby1 = PostgreSQL::Test::Cluster->new('standby1');
+$standby1->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $standby2 = PostgreSQL::Test::Cluster->new('standby2');
+$standby2->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $log_size1 = -s $standby1->logfile;
+my $log_size2 = -s $standby2->logfile;
+
+$standby1->start;
+$standby2->start;
+
+my ($segment, $offset) = lsn_to_segment_and_offset($start_page);
+my $segment_name = wal_segment_name($TLI, $segment);
+my $pattern =
+  qq(invalid magic number 0000 .* segment $segment_name.* offset $offset);
+
+# We expect both standby nodes to complain about empty page when trying to
+# assemble the record that spans over two pages, so wait for these in their
+# logs.
+$standby1->wait_for_log($pattern, $log_size1);
+$standby2->wait_for_log($pattern, $log_size2);
+
+# Now check the case of a promotion with a timeline jump handled at
+# page boundary with a continuation record.
+$standby1->promote;
+
+# This command forces standby2 to read a continuation record from the page
+# that is filled with zero bytes.
+$standby1->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Make sure WAL moves forward.
+$standby1->safe_psql('postgres',
+	'INSERT INTO t SELECT * FROM generate_series(1, 1000)');
+
+# Configure standby2 to stream from just promoted standby1 (it also pulls WAL
+# files from the archive).  It should be able to catch up.
+$standby2->enable_streaming($standby1);
+$standby2->reload;
+$standby1->wait_for_catchup('standby2', 'replay', $standby1->lsn('flush'));
+
+my $result = $standby2->safe_psql('postgres', "SELECT count(*) FROM t");
+print "standby2: $result\n";
+is($result, qq(1001), 'check streamed content on standby2');
+
+done_testing();
-- 
2.47.1

From 1bc8a5c47c969321f5882e5e759e94427e068901 Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@paquier.xyz>
Date: Thu, 16 Jan 2025 08:43:13 +0900
Subject: [PATCH v2] Fix incorrect header check for continuation WAL records on
 standby

XLogPageRead() checks immediately for an invalid WAL record header on a
standby, to be able to handle the case of continuation records that need
to be read across two different sources.  As written, the check was too
generic, applying to any target LSN.  What really matters is to make sure
that the page header is checked when attempting to read a LSN at the
boundary of a segment, to handle the case of a continuation record that
spawns across multiple pages over multiple segments, as when WAL
receivers are spawned they request WAL from the beginning of a segment.

This could cause standbys to loop infinitely when dealing with a
continuation record during a timeline jump, in the case where the
contents of the record in the follow-up page are invalid.

A regression test is added, able to reproduce the problem, where the
contents of a continuation record are overwritten with junk.  This is
inspired by 039_end_of_wal.pl.
---
 src/backend/access/transam/xlog.c             |  13 +-
 .../recovery/t/043_no_contrecord_switch.pl    | 153 ++++++++++++++++++
 2 files changed, 160 insertions(+), 6 deletions(-)
 create mode 100644 src/test/recovery/t/043_no_contrecord_switch.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3277a1cbe6..769f4a202f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12336,12 +12336,12 @@ retry:
 	 * validates the page header anyway, and would propagate the failure up to
 	 * ReadRecord(), which would retry. However, there's a corner case with
 	 * continuation records, if a record is split across two pages such that
-	 * we would need to read the two pages from different sources. For
-	 * example, imagine a scenario where a streaming replica is started up,
-	 * and replay reaches a record that's split across two WAL segments. The
-	 * first page is only available locally, in pg_wal, because it's already
-	 * been recycled in the master. The second page, however, is not present
-	 * in pg_wal, and we should stream it from the master. There is a recycled
+	 * we would need to read the two pages from different sources across two
+	 * WAL segments.
+	 *
+	 * The first page is only available locally, in pg_wal, because it's
+	 * already been recycled on the primary. The second page, however, is not
+	 * present in pg_wal, and we should stream it from the primary. There is a
 	 * WAL segment present in pg_wal, with garbage contents, however. We would
 	 * read the first page from the local WAL segment, but when reading the
 	 * second page, we would read the bogus, recycled, WAL segment. If we
@@ -12362,6 +12362,7 @@ retry:
 	 * responsible for the validation.
 	 */
 	if (StandbyMode &&
+		(targetPagePtr % wal_segment_size) == 0 &&
 		!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
 	{
 		/*
diff --git a/src/test/recovery/t/043_no_contrecord_switch.pl b/src/test/recovery/t/043_no_contrecord_switch.pl
new file mode 100644
index 0000000000..14c3ac3d0e
--- /dev/null
+++ b/src/test/recovery/t/043_no_contrecord_switch.pl
@@ -0,0 +1,153 @@
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+
+# Tests for already-propagated WAL segments ending in incomplete WAL records.
+
+use strict;
+use warnings;
+
+use File::Copy;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+use Fcntl qw(SEEK_SET);
+
+use integer;    # causes / operator to use integer math
+
+# Values queried from the server
+my $WAL_SEGMENT_SIZE;
+my $WAL_BLOCK_SIZE;
+my $TLI;
+
+# Build name of a WAL segment, used when filtering the contents of the server
+# logs.
+sub wal_segment_name
+{
+	my $tli = shift;
+	my $segment = shift;
+	return sprintf("%08X%08X%08X", $tli, 0, $segment);
+}
+
+# Calculate from a LSN (in bytes) its segment number and its offset, used
+# when filtering the contents of the server logs.
+sub lsn_to_segment_and_offset
+{
+	my $lsn = shift;
+	return ($lsn / $WAL_SEGMENT_SIZE, $lsn % $WAL_SEGMENT_SIZE);
+}
+
+# Get GUC value, converted to an int.
+sub get_int_setting
+{
+	my $node = shift;
+	my $name = shift;
+	return int(
+		$node->safe_psql(
+			'postgres',
+			"SELECT setting FROM pg_settings WHERE name = '$name'"));
+}
+
+sub start_of_page
+{
+	my $lsn = shift;
+	return $lsn & ~($WAL_BLOCK_SIZE - 1);
+}
+
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(allows_streaming => 1, has_archiving => 1);
+
+# The configuration is chosen here to minimize the friction with
+# concurrent WAL activity.  checkpoint_timeout avoids noise with
+# checkpoint activity, and autovacuum is disabled to avoid any
+# WAL activity generated by it.
+$primary->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = '30min'
+wal_keep_size = 1GB
+));
+
+$primary->start;
+$primary->backup('backup');
+
+$primary->safe_psql('postgres', "CREATE TABLE t AS SELECT 0");
+
+$WAL_SEGMENT_SIZE = get_int_setting($primary, 'wal_segment_size');
+$WAL_BLOCK_SIZE = get_int_setting($primary, 'wal_block_size');
+$TLI = $primary->safe_psql('postgres',
+	"SELECT timeline_id FROM pg_control_checkpoint()");
+
+# Get close to the end of the current WAL page, enough to fit the
+# beginning of a record that spans on two pages, generating a
+# continuation record.
+$primary->emit_wal(0);
+my $end_lsn =
+  $primary->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
+
+# Do some math to find the record size that will overflow the page, and
+# write it.
+my $overflow_size = $WAL_BLOCK_SIZE - ($end_lsn % $WAL_BLOCK_SIZE);
+$end_lsn = $primary->emit_wal($overflow_size);
+$primary->stop('immediate');
+
+# Find the beginning of the page with the continuation record and fill
+# the entire page with zero bytes to simulate broken replication.
+my $start_page = start_of_page($end_lsn);
+my $wal_file = $primary->write_wal($TLI, $start_page, $WAL_SEGMENT_SIZE,
+	"\x00" x $WAL_BLOCK_SIZE);
+
+# Copy the file we just "hacked" to the archives.
+copy($wal_file, $primary->archive_dir);
+
+# Start standby nodes and make sure they replay the file "hacked" from
+# the archives.
+my $standby1 = PostgreSQL::Test::Cluster->new('standby1');
+$standby1->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $standby2 = PostgreSQL::Test::Cluster->new('standby2');
+$standby2->init_from_backup(
+	$primary, 'backup',
+	standby => 1,
+	has_restoring => 1);
+
+my $log_size1 = -s $standby1->logfile;
+my $log_size2 = -s $standby2->logfile;
+
+$standby1->start;
+$standby2->start;
+
+my ($segment, $offset) = lsn_to_segment_and_offset($start_page);
+my $segment_name = wal_segment_name($TLI, $segment);
+my $pattern =
+  qq(invalid magic number 0000 .* segment $segment_name.* offset $offset);
+
+# We expect both standby nodes to complain about empty page when trying to
+# assemble the record that spans over two pages, so wait for these in their
+# logs.
+$standby1->wait_for_log($pattern, $log_size1);
+$standby2->wait_for_log($pattern, $log_size2);
+
+# Now check the case of a promotion with a timeline jump handled at
+# page boundary with a continuation record.
+$standby1->promote;
+
+# This command forces standby2 to read a continuation record from the page
+# that is filled with zero bytes.
+$standby1->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Make sure WAL moves forward.
+$standby1->safe_psql('postgres',
+	'INSERT INTO t SELECT * FROM generate_series(1, 1000)');
+
+# Configure standby2 to stream from just promoted standby1 (it also pulls WAL
+# files from the archive).  It should be able to catch up.
+$standby2->enable_streaming($standby1);
+$standby2->reload;
+$standby1->wait_for_catchup('standby2', 'replay', $standby1->lsn('flush'));
+
+my $result = $standby2->safe_psql('postgres', "SELECT count(*) FROM t");
+print "standby2: $result\n";
+is($result, qq(1001), 'check streamed content on standby2');
+
+done_testing();
-- 
2.47.1

signature.asc
Description: PGP signature

Re: Infinite loop in XLogPageRead() on standby

Reply via email to