At Tue, 7 Sep 2021 09:37:10 +0900, Michael Paquier <mich...@paquier.xyz> wrote 
in 
> On Mon, Sep 06, 2021 at 12:03:32PM -0400, Tom Lane wrote:
> > I scraped the buildfarm logs looking for similar failures, and didn't
> > find any.  (019_replslot_limit.pl hasn't failed at all in the farm
> > since the last fix it received, in late July.)
> 
> The interesting bits are in 019_replslot_limit_primary3.log.  In a
> failed run, I can see that we get immediately a process termination,
> as follows:
> 2021-09-07 07:52:53.402 JST [22890] LOG:  terminating process 23082 to 
> release replication slot "rep3"
> 2021-09-07 07:52:53.442 JST [23082] standby_3 FATAL:  terminating connection 
> due to administrator command
> 2021-09-07 07:52:53.442 JST [23082] standby_3 STATEMENT:  START_REPLICATION 
> SLOT "rep3" 0/700000 TIMELINE 1
> 2021-09-07 07:52:53.452 JST [23133] 019_replslot_limit.pl LOG:  statement: 
> SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'
> 
> In a successful run, the pattern is different:
> 2021-09-07 09:27:39.832 JST [57114] standby_3 FATAL:  terminating connection 
> due to administrator command
> 2021-09-07 09:27:39.832 JST [57114] standby_3 STATEMENT:  START_REPLICATION 
> SLOT "rep3" 0/700000 TIMELINE 1
> 2021-09-07 09:27:39.832 JST [57092] LOG:  invalidating slot "rep3" because 
> its restart_lsn 0/7000D8 exceeds max_slot_wal_keep_size
> 2021-09-07 09:27:39.833 JST [57092] LOG:  checkpoint complete: wrote
> 19 buffers (14.8%); 0 WAL file(s) added, 1 removed, 0 recycled;
> write=0.025 s, sync=0.001 s, total=0.030 s; sync files=0,
> longest=0.000 s, average=0.000 s; distance=1024 kB, estimate=1024 kB
> 2021-09-07 09:27:39.833 JST [57092] LOG:  checkpoints are occurring too 
> frequently (0 seconds apart)
> 2021-09-07 09:27:39.833 JST [57092] HINT:  Consider increasing the 
> configuration parameter "max_wal_size".
> 2021-09-07 09:27:39.851 JST [57126] 019_replslot_limit.pl LOG:  statement: 
> SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'
> 
> The slot invalidation is forgotten because we don't complete a
> checkpoint that does the work it should do, no?  There is a completed
> checkpoint before we query pg_replication_slots, and the buildfarm
> shows the same thing.

It seems like the "kill 'STOP'" in the script didn't suspend the
processes before advancing WAL. The attached uses 'ps' command to
check that since I didn't come up with the way to do the same in Perl.

I'm still not sure it works as expected, though.  (Imagining the case
where the state changes before the process actually stops..)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index e065c5c008..7388900008 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -346,6 +346,8 @@ $logstart = get_log_size($node_primary3);
 # freeze walsender and walreceiver. Slot will still be active, but walreceiver
 # won't get anything anymore.
 kill 'STOP', $senderpid, $receiverpid;
+wait_to_stop($senderpid, $receiverpid);
+
 advance_wal($node_primary3, 2);
 
 my $max_attempts = 180;
@@ -426,3 +428,30 @@ sub find_in_log
 
 	return $log =~ m/$pat/;
 }
+
+sub wait_to_stop
+{
+	my (@pids) = @_;
+	my $max_attempts = 180;
+
+	# Haven't found the means to do the same on Windows
+	return if $TestLib::windows_os;
+
+	while ($max_attempts-- >= 0)
+	{
+		my ($all_stopped) = 1;
+
+		foreach my $pid (@pids)
+		{
+			if (`ps -p $pid -o pid,state | tail -1` !~ /^ *\d+ +T/)
+			{
+				$all_stopped = 0;
+				last;
+			}
+		}
+
+		last if ($all_stopped);
+
+		sleep 1;
+	}
+}

Reply via email to