Re: has_wal_read_bug

Noah Misch Sat, 29 Oct 2022 20:17:05 -0700

On Tue, May 17, 2022 at 12:15:35AM -0700, Noah Misch wrote:
> On Tue, May 17, 2022 at 11:50:51AM +1200, Thomas Munro wrote:
> > 027_stream_regress.pl has:
> > 
> > if (PostgreSQL::Test::Utils::has_wal_read_bug)
> > {
> >     # We'd prefer to use Test::More->builder->todo_start, but the bug causes
> >     # this test file to die(), not merely to fail.
> >     plan skip_all => 'filesystem bug';
> > }
> > 
> > Is the die() referenced there the one from the system_or_bail() call
> > that commit a096813b got rid of?
> 
> No, it was the 'croak "timed out waiting for catchup"',
> e.g. 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2022-01-25%2016%3A56%3A26
> 
> > Here's a failure in 031_recovery_conflict.pl that smells like
> > concurrent pread() corruption:
> > 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2022-05-16%2015%3A45%3A54
> > 
> > 2022-05-16 18:10:33.375 CEST [52106:1] LOG:  started streaming WAL
> > from primary at 0/3000000 on timeline 1
> > 2022-05-16 18:10:33.621 CEST [52105:5] LOG:  incorrect resource
> > manager data checksum in record at 0/338FDC8
> > 2022-05-16 18:10:33.622 CEST [52106:2] FATAL:  terminating walreceiver
> > process due to administrator command
> 
> Agreed.  Here, too:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2022-05-09%2015%3A46%3A03
> 
> > Presumably we also need the has_wal_read_bug kludge in all these new
> > tests that use replication.
> 
> That is an option.  One alternative is to reconfigure those three animals to
> remove --enable-tap-tests.  Another alternative is to make the build system
> skip all files of all TAP suites on affected systems.  Handling this on a
> file-by-file basis seemed reasonable to me when only two files had failed that
> way.  Now, five files have failed.  We have wait_for_catchup calls in
> fifty-one files, and I wouldn't have chosen the has_wal_read_bug approach if I
> had expected fifty-one files to eventually call it.  I could tolerate it,
> though.


Squashing another test that failed multiple times (commit a9f8ca6) led me to
think of another option, attached.  When wait_for_catchup() fails under
has_wal_read_bug(), end the suite with an abrupt success.  Thoughts?

Author:     Noah Misch <n...@leadboat.com>
Commit:     Noah Misch <n...@leadboat.com>

    If wait_for_catchup fails under has_wal_read_bug, skip balance of test.
    
    Test files should now ignore has_wal_read_bug() so long as
    wait_for_catchup() is their only known way of reaching the bug.  That's
    at least five files today, a number expected to grow over time.  This
    commit removes skip logic from three.  By doing so, systems having the
    bug regain the ability to catch other kinds of defects via those three
    tests.  The other two, 002_databases.pl and 031_recovery_conflict.pl,
    have been unprotected.  Back-patch to v15, where done_testing() first
    became our standard.
    
    Reviewed by FIXME.
    
    Discussion: https://postgr.es/m/FIXME

diff --git a/contrib/bloom/t/001_wal.pl b/contrib/bloom/t/001_wal.pl
index f7f16dc..9416a64 100644
--- a/contrib/bloom/t/001_wal.pl
+++ b/contrib/bloom/t/001_wal.pl
@@ -8,13 +8,6 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
-if (PostgreSQL::Test::Utils::has_wal_read_bug)
-{
-       # We'd prefer to use Test::More->builder->todo_start, but the bug causes
-       # this test file to die(), not merely to fail.
-       plan skip_all => 'filesystem bug';
-}
-
 my $node_primary;
 my $node_standby;
 
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm 
b/src/test/perl/PostgreSQL/Test/Cluster.pm
index d80134b..d1017b7 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2606,8 +2606,23 @@ sub wait_for_catchup
        my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 
'streaming'
          FROM pg_catalog.pg_stat_replication
          WHERE application_name IN ('$standby_name', 'walreceiver')];
-       $self->poll_query_until('postgres', $query)
-         or croak "timed out waiting for catchup";
+       if (!$self->poll_query_until('postgres', $query))
+       {
+               if (PostgreSQL::Test::Utils::has_wal_read_bug)
+               {
+                       # Mimic having skipped the test file.  If >0 tests have 
run, the
+                       # harness won't accept a skip; otherwise, it won't 
accept
+                       # done_testing().  Force a nonzero count by running one 
test.
+                       ok(1, 'dummy test before skip for filesystem bug');
+                       carp "skip rest: timed out waiting for catchup & 
filesystem bug";
+                       done_testing();
+                       exit 0;
+               }
+               else
+               {
+                       croak "timed out waiting for catchup";
+               }
+       }
        print "done\n";
        return;
 }
diff --git a/src/test/recovery/t/027_stream_regress.pl 
b/src/test/recovery/t/027_stream_regress.pl
index 7982ac0..69d6ddf 100644
--- a/src/test/recovery/t/027_stream_regress.pl
+++ b/src/test/recovery/t/027_stream_regress.pl
@@ -6,13 +6,6 @@ use PostgreSQL::Test::Utils;
 use Test::More;
 use File::Basename;
 
-if (PostgreSQL::Test::Utils::has_wal_read_bug)
-{
-       # We'd prefer to use Test::More->builder->todo_start, but the bug causes
-       # this test file to die(), not merely to fail.
-       plan skip_all => 'filesystem bug';
-}
-
 # Initialize primary node
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 $node_primary->init(allows_streaming => 1);
diff --git a/src/test/recovery/t/032_relfilenode_reuse.pl 
b/src/test/recovery/t/032_relfilenode_reuse.pl
index 499ec34..92ec510 100644
--- a/src/test/recovery/t/032_relfilenode_reuse.pl
+++ b/src/test/recovery/t/032_relfilenode_reuse.pl
@@ -5,13 +5,6 @@ use PostgreSQL::Test::Utils;
 use Test::More;
 use File::Basename;
 
-if (PostgreSQL::Test::Utils::has_wal_read_bug)
-{
-       # We'd prefer to use Test::More->builder->todo_start, but the bug causes
-       # this test file to die(), not merely to fail.
-       plan skip_all => 'filesystem bug';
-}
-
 
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 $node_primary->init(allows_streaming => 1);

Re: has_wal_read_bug

Reply via email to