002_standby.pl flaky?

Tom Lane Sat, 08 Jan 2022 15:42:15 -0800

Andrew Dunstan <[email protected]> writes:
> On 12/30/21 15:01, Thomas Munro wrote:
>> There's a wait for replay that is open coded (instead of using the
>> wait_for_catchup() routine), and sometimes the second of two such
>> waits at line 51 (in master) times out after 3 minutes with "standby
>> never caught up".  It's happening on three particular Windows boxes,
>> but once also happened on the AIX box "tern".


> FYI, drongo and fairywren are run on the same AWS/EC2 Windows Server
> 2019 instance. Nothing else runs on it.

I spent a little time looking into this just now.  There are similar
failures in both 002_standby.pl and 003_standby_2.pl, which is
unsurprising because there are essentially-identical test sequences
in both.  What I've realized is that the issue is triggered by
this sequence:

$standby->start;
...
$primary->restart;
$primary->safe_psql('postgres', 'checkpoint');
my $primary_lsn =
  $primary->safe_psql('postgres', 'select pg_current_wal_lsn()');
$standby->poll_query_until('postgres',
        qq{SELECT '$primary_lsn'::pg_lsn <= pg_last_wal_replay_lsn()})
  or die "standby never caught up";

(the failing poll_query_until is at line 51 in 002_standby.pl, or
line 37 in 003_standby_2.pl).  That is, we have forced a primary
restart since the standby first connected to the primary, and
now we have to wait for the standby to reconnect and catch up.

*These two tests seem to be the only TAP tests that do that*.
So I think there's not really anything specific to commit_ts testing
involved, it's just a dearth of primary restarts elsewhere.

Looking at the logs in the failing cases, there's no evidence
that the standby has even detected the primary's disconnection,
which explains why it hasn't attempted to reconnect.  For
example, in the most recent HEAD failure,

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2022-01-03%2018%3A04%3A41

the standby reports successful connection:

2022-01-03 18:58:04.920 UTC [179700:1] LOG:  started streaming WAL from primary 
at 0/3000000 on timeline 1

(which we can also see in the primary's log), but after that
there's no log traffic at all except the test script's vain
checks of pg_last_wal_replay_lsn().  In the same animal's
immediately preceding successful run,

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=fairywren&dt=2022-01-03%2015%3A04%3A41&stg=module-commit_ts-check

we see:

2022-01-03 15:59:24.186 UTC [176664:1] LOG:  started streaming WAL from primary 
at 0/3000000 on timeline 1
2022-01-03 15:59:25.003 UTC [176664:2] LOG:  replication terminated by primary 
server
2022-01-03 15:59:25.003 UTC [176664:3] DETAIL:  End of WAL reached on timeline 
1 at 0/3030CB8.
2022-01-03 15:59:25.003 UTC [176664:4] FATAL:  could not send end-of-streaming 
message to primary: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
        no COPY in progress
2022-01-03 15:59:25.005 UTC [177092:5] LOG:  invalid record length at 
0/3030CB8: wanted 24, got 0
...
2022-01-03 15:59:25.564 UTC [177580:1] LOG:  started streaming WAL from primary 
at 0/3000000 on timeline 1

So for some reason, on these machines detection of walsender-initiated
connection close is unreliable ... or maybe, the walsender didn't close
the connection, but is somehow still hanging around?  Don't have much idea
where to dig beyond that, but maybe someone else will.  I wonder in
particular if this could be related to our recent discussions about
whether to use shutdown(2) on Windows --- could we need to do the
equivalent of 6051857fc/ed52c3707 on walsender connections?

                        regards, tom lane

Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

Reply via email to