Re: Fixing WAL instability in various TAP tests

Tom Lane Tue, 28 Sep 2021 12:00:37 -0700

Mark Dilger <[email protected]> writes:
> Perhaps having the bloom index messed up answers that, though.  I think it 
> should be easy enough to get the path to the heap main table fork and the 
> bloom main index fork for both the primary and standby and do a filesystem 
> comparison as part of the wal test.  That would tell us if they differ, and 
> also if the differences are limited to just one or the other.


I think that's probably overkill, and definitely out-of-scope for
contrib/bloom.  If we fear that WAL replay is not reproducing the data
accurately, we should be testing for that in some more centralized place.

Anyway, I confirmed my diagnosis by adding a delay in WAL apply
(0001 below); that makes this test fall over spectacularly.
And 0002 fixes it.  So I propose to push 0002 as soon as the
v14 release freeze ends.

Should we back-patch 0002?  I'm inclined to think so.  Should
we then also back-patch enablement of the bloom test?  Less
sure about that, but I'd lean to doing so.  A test that appears
to be there but isn't actually invoked is pretty misleading.

                        regards, tom lane

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749d..eecbe57aee 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7370,6 +7370,9 @@ StartupXLOG(void)
 			{
 				bool		switchedTLI = false;
 
+				if (random() < INT_MAX/100)
+					pg_usleep(100000);
+
 #ifdef WAL_DEBUG
 				if (XLOG_DEBUG ||
 					(rmid == RM_XACT_ID && trace_recovery_messages <= DEBUG2) ||

diff --git a/contrib/bloom/t/001_wal.pl b/contrib/bloom/t/001_wal.pl
index 55ad35926f..be8916a8eb 100644
--- a/contrib/bloom/t/001_wal.pl
+++ b/contrib/bloom/t/001_wal.pl
@@ -16,12 +16,10 @@ sub test_index_replay
 {
 	my ($test_name) = @_;
 
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
 	# Wait for standby to catch up
-	my $applname = $node_standby->name;
-	my $caughtup_query =
-	  "SELECT pg_current_wal_lsn() <= write_lsn FROM pg_stat_replication WHERE application_name = '$applname';";
-	$node_primary->poll_query_until('postgres', $caughtup_query)
-	  or die "Timed out while waiting for standby 1 to catch up";
+	$node_primary->wait_for_catchup($node_standby);
 
 	my $queries = qq(SET enable_seqscan=off;
 SET enable_bitmapscan=on;

Re: Fixing WAL instability in various TAP tests

Reply via email to