Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

Thomas Munro Thu, 12 May 2022 19:20:51 -0700

On Thu, May 12, 2022 at 4:57 PM Thomas Munro <thomas.mu...@gmail.com> wrote:
> On Thu, May 12, 2022 at 3:13 PM Thomas Munro <thomas.mu...@gmail.com> wrote:
> > error running SQL: 'psql:<stdin>:1: ERROR:  source database
> > "conflict_db_template" is being accessed by other users
> > DETAIL:  There is 1 other session using the database.'
>
> Oh, for this one I think it may just be that the autovacuum worker
> with PID 23757 took longer to exit than the 5 seconds
> CountOtherDBBackends() is prepared to wait, after sending it SIGTERM.


In this test, autovacuum_naptime is set to 1s (per Andres, AV was
implicated when he first saw the problem with pg_upgrade, hence desire
to crank it up).  That's not necessary: commenting out the active line
in ProcessBarrierSmgrRelease() shows that the tests reliably reproduce
data corruption without it.  Let's just take that out.

As for skink failing, the timeout was hard coded 300s for the whole
test, but apparently that wasn't enough under valgrind.  Let's use the
standard PostgreSQL::Test::Utils::timeout_default (180s usually), but
reset it for each query we send.

See attached.

From ffcb61004fd06c9b2db56c7fe045c7c726d67a72 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Fri, 13 May 2022 13:40:03 +1200
Subject: [PATCH] Fix slow animal timeouts in 032_relfilenode_reuse.pl.

Per BF animal chipmunk:  CREATE DATABASE could apparently fail due to an
AV process being in the template database and not quitting fast enough
for the 5 second timeout in CountOtherDBBackends().  The test script had
autovacuum_naptime=1s to encourage more activity that opens fds, but
that wasn't strictly necessary for this test.  Take it out.

Per BF animal skink:  the test had a global 300s timeout, but apparently
that was not enough under valgrind.  Use the standard timeout
PostgreSQL::Test::Utils::timeout_default, but reset it for each query we
run.

Discussion:
---
 src/test/recovery/t/032_relfilenode_reuse.pl | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/src/test/recovery/t/032_relfilenode_reuse.pl b/src/test/recovery/t/032_relfilenode_reuse.pl
index ac9340b7dd..5a6a759aa5 100644
--- a/src/test/recovery/t/032_relfilenode_reuse.pl
+++ b/src/test/recovery/t/032_relfilenode_reuse.pl
@@ -14,7 +14,6 @@ log_connections=on
 # to avoid "repairing" corruption
 full_page_writes=off
 log_min_messages=debug2
-autovacuum_naptime=1s
 shared_buffers=1MB
 ]);
 $node_primary->start;
@@ -28,11 +27,8 @@ $node_standby->init_from_backup($node_primary, $backup_name,
 	has_streaming => 1);
 $node_standby->start;
 
-# To avoid hanging while expecting some specific input from a psql
-# instance being driven by us, add a timeout high enough that it
-# should never trigger even on very slow machines, unless something
-# is really wrong.
-my $psql_timeout = IPC::Run::timer(300);
+# We'll reset this timeout for each individual query we run.
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
 
 my %psql_primary = (stdin => '', stdout => '', stderr => '');
 $psql_primary{run} = IPC::Run::start(
@@ -202,6 +198,9 @@ sub send_query_and_wait
 	my ($psql, $query, $untl) = @_;
 	my $ret;
 
+	$psql_timeout->reset();
+	$psql_timeout->start();
+
 	# send query
 	$$psql{stdin} .= $query;
 	$$psql{stdin} .= "\n";
-- 
2.36.0

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

Reply via email to