Hello, hackers,

I believe there is a small problem in the 039_end_of_wal.pl's "xl_tot_len zero" test. It supposes that after immediate shutdown the server, upon startup recovery, should always produce a message matching "invalid record length at .*: wanted 24, got 0". However, if the circumstances are just right and we happened to hit exactly on the edge of WAL page, then the message on startup recovery would be "invalid magic number 0000 in log segment .*, offset .*". The test does not take that into account.

Now, reproducing this is somewhat tricky, because exact position in WAL at the test time depends on what exactly initdb did, and that not only differs in different major verisons, but also depends on e.g. username length, locales available, and, perhaps, more. Even though originally this problem was found "in the wild" on one particular system on one particular code branch, I've written small helper patch to make reproduction on master easier, see 0001-repro-for-039_end_of_wal-s-problem-with-page-end.patch.

This patch adds single emit_message of (hopefully) the right size to make sure we hit end of WAL block right by the time we call $node->stop('immediate') in "xl_tot_len zero" test. With this patch, "xl_tot_len zero" test fails every time because the server writes "invalid magic number 0000 in log segment" while the test still only expects "invalid record length at .*: wanted 24, got 0". If course, this 0001 patch is *not* meant to be committed, but only as an issue reproduction helper.

I can think of two possible fixes:

1. Update advance_out_of_record_splitting_zone to also avoid stopping at
   exactly the block end:

     my $page_offset = $end_lsn % $WAL_BLOCK_SIZE;
-    while ($page_offset >= $WAL_BLOCK_SIZE - $page_threshold)
+ while ($page_offset >= $WAL_BLOCK_SIZE - $page_threshold || $page_offset <= $SizeOfXLogShortPHD)
     {
see 0002-fix-xl_tot_len-zero-test-amend-advance_out_of.patch

We need to compare with $SizeOfXLogShortPHD (and not with zero) because at that point, even though we didn't actually write out new WAL page
yet, it's header is already in place in memory and taken in account
for LSN reporting.

2. Alternatively, amend "xl_tot_len zero" test to expect "invalid magic
   number 0000 in WAL segment" message as well:

 $node->start;
 ok( $node->log_contains(
+        "invalid magic number 0000 in WAL segment|" .
"invalid record length at .*: expected at least 24, got 0", $log_size
     ),
see 0003-alt.fix-for-xl_tot_len-zero-test-accept-invalid.patch

I think it makes sense to backport whatever the final change would be to all branches with 039_end_of_wal (REL_12+).

Any thoughts?

Anton Voloshin
Postgres Professional, The Russian Postgres Company
https://postgrespro.ru
From 5f12139816f6c1bc7d625ba8007aedb8f5d04a71 Mon Sep 17 00:00:00 2001
From: Anton Voloshin <a.volos...@postgrespro.ru>
Date: Thu, 18 Jan 2024 12:45:12 +0300
Subject: [PATCH 1/3] repro for 039_end_of_wal's problem with page
 end

Tags: commitfest_hotfix
---
 src/test/recovery/t/039_end_of_wal.pl | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/src/test/recovery/t/039_end_of_wal.pl b/src/test/recovery/t/039_end_of_wal.pl
index f9acc83c7d0..ecf9b6089de 100644
--- a/src/test/recovery/t/039_end_of_wal.pl
+++ b/src/test/recovery/t/039_end_of_wal.pl
@@ -36,6 +36,8 @@ my $WAL_SEGMENT_SIZE;
 my $WAL_BLOCK_SIZE;
 my $TLI;
 
+my $SizeOfXLogShortPHD = 24; # rounded up to 8 bytes
+
 # Build path of a WAL segment.
 sub wal_segment_path
 {
@@ -258,9 +260,27 @@ my $prev_lsn;
 note "Single-page end-of-WAL detection";
 ###########################################################################
 
+my $lsn = get_insert_lsn($node);
+my $lsn_offset = $lsn % $WAL_BLOCK_SIZE;
+my $empty_msg_size = ( ( ($ENV{EMPTY_MSG_SIZE} || 51) + 7) / 8) * 8;
+my $commit_msg_size = ( (34 + 7) / 8) * 8;
+# empty logical msg and commit message take this many bytes of WAL:
+my $empty_msg_overhead = $empty_msg_size + $commit_msg_size;
+# cound overhead twice to account for two emit_message calls below:
+# we want to hit the page edge after the second call.
+my $target_lsn_offset = $WAL_BLOCK_SIZE * 2 - $empty_msg_overhead * 2;
+my $msg_size = ($target_lsn_offset - $lsn_offset) % $WAL_BLOCK_SIZE;
+# If we happen to switch to the next WAL block mid-message, reduce the message
+# by $SizeOfXLogShortPHD (minimal page header) to hit the same target.
+if ($lsn_offset + $msg_size >= $WAL_BLOCK_SIZE) { $msg_size -= $SizeOfXLogShortPHD; }
+print "QWE0: $lsn WAL_BLOCK_SIZE='$WAL_BLOCK_SIZE', lsn_offset='$lsn_offset' target_lsn_offset='$target_lsn_offset' msg_size='$msg_size'\n";
+emit_message($node, $msg_size);
+printf "QWE1: %s - after corrective msg\n", $node->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
 # xl_tot_len is 0 (a common case, we hit trailing zeroes).
 emit_message($node, 0);
+printf "QWE2: %s - after zero-length message\n", $node->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
 $end_lsn = advance_out_of_record_splitting_zone($node);
+printf "QWE3: %s - after advance_out_of_record_splitting_szone()\n", $node->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
 $node->stop('immediate');
 my $log_size = -s $node->logfile;
 $node->start;
-- 
2.43.0

From eebbc755c880071194bbd02825b4c47f11f99213 Mon Sep 17 00:00:00 2001
From: Anton Voloshin <a.volos...@postgrespro.ru>
Date: Thu, 18 Jan 2024 14:47:52 +0300
Subject: [PATCH 2/3] fix "xl_tot_len zero" test: amend
 advance_out_of_record_splitting_zone

Tags: commitfest_hotfix
---
 src/test/recovery/t/039_end_of_wal.pl | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/test/recovery/t/039_end_of_wal.pl b/src/test/recovery/t/039_end_of_wal.pl
index ecf9b6089de..3168d991fee 100644
--- a/src/test/recovery/t/039_end_of_wal.pl
+++ b/src/test/recovery/t/039_end_of_wal.pl
@@ -173,6 +173,8 @@ sub build_page_header
 # a couple of small records.  This inserts a few records of a fixed size, until
 # the threshold gets close enough to the end of the WAL page inserting records
 # to.
+# Also avoid the very beginning of a page: that would break a test for a
+# zero-length logical message.
 sub advance_out_of_record_splitting_zone
 {
 	my $node = shift;
@@ -180,7 +182,7 @@ sub advance_out_of_record_splitting_zone
 	my $page_threshold = $WAL_BLOCK_SIZE / 4;
 	my $end_lsn = get_insert_lsn($node);
 	my $page_offset = $end_lsn % $WAL_BLOCK_SIZE;
-	while ($page_offset >= $WAL_BLOCK_SIZE - $page_threshold)
+	while ($page_offset >= $WAL_BLOCK_SIZE - $page_threshold || $page_offset <= $SizeOfXLogShortPHD)
 	{
 		emit_message($node, $page_threshold);
 		$end_lsn = get_insert_lsn($node);
-- 
2.43.0

From 2419ec6780f1e6a003b553066f83d966fee0c15b Mon Sep 17 00:00:00 2001
From: Anton Voloshin <a.volos...@postgrespro.ru>
Date: Thu, 18 Jan 2024 15:03:11 +0300
Subject: [PATCH 3/3] alt.fix for "xl_tot_len zero" test: accept
 "invalid magic number 0000 in WAL segment" as well

Tags: commitfest_hotfix
---
 src/test/recovery/t/039_end_of_wal.pl | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/test/recovery/t/039_end_of_wal.pl b/src/test/recovery/t/039_end_of_wal.pl
index 3168d991fee..f7a6a722a78 100644
--- a/src/test/recovery/t/039_end_of_wal.pl
+++ b/src/test/recovery/t/039_end_of_wal.pl
@@ -287,6 +287,7 @@ $node->stop('immediate');
 my $log_size = -s $node->logfile;
 $node->start;
 ok( $node->log_contains(
+		"invalid magic number 0000 in WAL segment|" .
 		"invalid record length at .*: expected at least 24, got 0", $log_size
 	),
 	"xl_tot_len zero");
-- 
2.43.0

Reply via email to