pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages

vignesh C Wed, 25 Jun 2025 09:50:24 -0700

Hi,

Buildfarm identified one issue at [1] where it could not identify a
partial WAL record spanning across 2 pages was written due to
immediate shutdown.
Consider a scenario where a WAL  record is split across multiple WAL
pages. If the server crashes before the entire WAL record is written,
the recovery process detects the incomplete (broken) record. In such
cases, a special flag, XLP_FIRST_IS_OVERWRITE_CONTRECORD, is set in
the page header to indicate this condition. When we attempt to
retrieve changes using a logical replication slot, the system reads
WAL data based on the record header. In one specific instance, a total
of 8133 bytes were required to reconstruct the WAL record. Of this,
277 bytes were expected to be present in the subsequent WAL page.
However, only 248 bytes were available, because no new WAL records had
been generated as there were no operations performed on the system
after the crash. As a result, the slot_get_changes function
continuously checks whether the current WAL flush position has reached
0x29004115 (the offset corresponding to the required 277 bytes in the
next WAL page). Since no new WAL is being generated, the flush
position always returns 0x290040F8 (the 248-byte offset), causing the
function to wait indefinitely at read_local_xlog_page_guts function.

Currently, the logic attempts to read the complete WAL record based on
the size obtained before the crash—even though only a partial record
was written. It then checks the page header to determine whether the
XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set only after reading the
complete WAL record at XLogDecodeNextRecord function, but since that
much WAL data was not available in the system we never get a chance to
check the header after this.. To address this issue, a more robust
approach would be to first read the page header, check if the
XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set, and only then proceed
to read the WAL record size if the record is not marked as a partial
overwrite. This would prevent the system from waiting for WAL data
that will never arrive. Attached partial_wal_record_fix.patch patch
for this.

I don't have a consistent test to reproduce this issue, currently this
issue can be reproduced by running the 046_checkpoint_logical_slot
test about 50 times. This test 046_checkpoint_logical_slot was
reverted recently after it caused a few buildfarm failures discussed
at [2]. The same test is also attached as
reverted_test_046_checkpoint_logical_slot.patch.

Added members who were involved in the discussion of this issue.

Thoughts?

[1] -
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2025-06-19+23%3A47%3A08
[2] -
https://www.postgresql.org/message-id/CAPpHfdsO9s5he3xHWNFtwvXtvsftu3nNz%3DPV9fdN32UOh-Z3tA%40mail.gmail.com

Regards,
Vignesh

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 2790ade1f91..54d6c67a713 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -723,11 +723,14 @@ restart:
 			/* Calculate pointer to beginning of next page */
 			targetPagePtr += XLOG_BLCKSZ;
 
-			/* Wait for the next page to become available */
-			readOff = ReadPageInternal(state, targetPagePtr,
-									   Min(total_len - gotlen + SizeOfXLogShortPHD,
-										   XLOG_BLCKSZ));
-
+			/*
+			 * Read the page header before processing the WAL record data. This
+			 * is necessary to correctly handle cases where the previous record
+			 * ended as a partial record. Attempting to read the full
+			 * record without checking the header may result in waiting
+			 * indefinitely for data that doesn't exist.
+			 */
+			readOff = ReadPageInternal(state, targetPagePtr, SizeOfXLogShortPHD);
 			if (readOff == XLREAD_WOULDBLOCK)
 				return XLREAD_WOULDBLOCK;
 			else if (readOff < 0)
@@ -761,6 +764,16 @@ restart:
 				goto err;
 			}
 
+			/* Wait for the next page to become available */
+			readOff = ReadPageInternal(state, targetPagePtr,
+									   Min(total_len - gotlen + SizeOfXLogShortPHD,
+										   XLOG_BLCKSZ));
+
+			if (readOff == XLREAD_WOULDBLOCK)
+				return XLREAD_WOULDBLOCK;
+			else if (readOff < 0)
+				goto err;
+
 			/*
 			 * Cross-check that xlp_rem_len agrees with how much of the record
 			 * we expect there to be left.

diff --git a/src/test/recovery/t/046_checkpoint_logical_slot.pl b/src/test/recovery/t/046_checkpoint_logical_slot.pl
new file mode 100644
index 00000000000..d67c5108d78
--- /dev/null
+++ b/src/test/recovery/t/046_checkpoint_logical_slot.pl
@@ -0,0 +1,136 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+#
+# This test verifies the case when the logical slot is advanced during
+# checkpoint. The test checks that the logical slot's restart_lsn still refers
+# to an existed WAL segment after immediate restart.
+#
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+	plan skip_all => 'Injection points not supported by this build';
+}
+
+my ($node, $result);
+
+$node = PostgreSQL::Test::Cluster->new('mike');
+$node->init;
+$node->append_conf('postgresql.conf', "wal_level = 'logical'");
+$node->start;
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node->check_extension('injection_points'))
+{
+	plan skip_all => 'Extension injection_points not installed';
+}
+
+$node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
+
+# Create the two slots we'll need.
+$node->safe_psql('postgres',
+	q{select pg_create_logical_replication_slot('slot_logical', 'test_decoding')}
+);
+$node->safe_psql('postgres',
+	q{select pg_create_physical_replication_slot('slot_physical', true)});
+
+# Advance both slots to the current position just to have everything "valid".
+$node->safe_psql('postgres',
+	q{select count(*) from pg_logical_slot_get_changes('slot_logical', null, null)}
+);
+$node->safe_psql('postgres',
+	q{select pg_replication_slot_advance('slot_physical', pg_current_wal_lsn())}
+);
+
+# Run checkpoint to flush current state to disk and set a baseline.
+$node->safe_psql('postgres', q{checkpoint});
+
+# Generate some transactions to get RUNNING_XACTS.
+my $xacts = $node->background_psql('postgres');
+$xacts->query_until(
+	qr/run_xacts/,
+	q(\echo run_xacts
+SELECT 1 \watch 0.1
+\q
+));
+
+$node->advance_wal(20);
+
+# Run another checkpoint to set a new restore LSN.
+$node->safe_psql('postgres', q{checkpoint});
+
+$node->advance_wal(20);
+
+# Run another checkpoint, this time in the background, and make it wait
+# on the injection point) so that the checkpoint stops right before
+# removing old WAL segments.
+note('starting checkpoint');
+
+my $checkpoint = $node->background_psql('postgres');
+$checkpoint->query_safe(
+	q(select injection_points_attach('checkpoint-before-old-wal-removal','wait'))
+);
+$checkpoint->query_until(
+	qr/starting_checkpoint/,
+	q(\echo starting_checkpoint
+checkpoint;
+\q
+));
+
+# Wait until the checkpoint stops right before removing WAL segments.
+note('waiting for injection_point');
+$node->wait_for_event('checkpointer', 'checkpoint-before-old-wal-removal');
+note('injection_point is reached');
+
+# Try to advance the logical slot, but make it stop when it moves to the next
+# WAL segment (this has to happen in the background, too).
+my $logical = $node->background_psql('postgres');
+$logical->query_safe(
+	q{select injection_points_attach('logical-replication-slot-advance-segment','wait');}
+);
+$logical->query_until(
+	qr/get_changes/,
+	q(
+\echo get_changes
+select count(*) from pg_logical_slot_get_changes('slot_logical', null, null) \watch 1
+\q
+));
+
+# Wait until the slot's restart_lsn points to the next WAL segment.
+note('waiting for injection_point');
+$node->wait_for_event('client backend',
+	'logical-replication-slot-advance-segment');
+note('injection_point is reached');
+
+# OK, we're in the right situation: time to advance the physical slot, which
+# recalculates the required LSN, and then unblock the checkpoint, which
+# removes the WAL still needed by the logical slot.
+$node->safe_psql('postgres',
+	q{select pg_replication_slot_advance('slot_physical', pg_current_wal_lsn())}
+);
+
+# Continue the checkpoint.
+$node->safe_psql('postgres',
+	q{select injection_points_wakeup('checkpoint-before-old-wal-removal')});
+
+# Abruptly stop the server (1 second should be enough for the checkpoint
+# to finish; it would be better).
+$node->stop('immediate');
+
+$node->start;
+
+eval {
+	$node->safe_psql('postgres',
+		q{select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);}
+	);
+};
+is($@, '', "Logical slot still valid");
+
+done_testing();

pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages

Reply via email to