Andrew Dunstan <and...@dunslane.net> writes: > On 12/30/21 15:01, Thomas Munro wrote: >> There's a wait for replay that is open coded (instead of using the >> wait_for_catchup() routine), and sometimes the second of two such >> waits at line 51 (in master) times out after 3 minutes with "standby >> never caught up". It's happening on three particular Windows boxes, >> but once also happened on the AIX box "tern".
> FYI, drongo and fairywren are run on the same AWS/EC2 Windows Server > 2019 instance. Nothing else runs on it. I spent a little time looking into this just now. There are similar failures in both 002_standby.pl and 003_standby_2.pl, which is unsurprising because there are essentially-identical test sequences in both. What I've realized is that the issue is triggered by this sequence: $standby->start; ... $primary->restart; $primary->safe_psql('postgres', 'checkpoint'); my $primary_lsn = $primary->safe_psql('postgres', 'select pg_current_wal_lsn()'); $standby->poll_query_until('postgres', qq{SELECT '$primary_lsn'::pg_lsn <= pg_last_wal_replay_lsn()}) or die "standby never caught up"; (the failing poll_query_until is at line 51 in 002_standby.pl, or line 37 in 003_standby_2.pl). That is, we have forced a primary restart since the standby first connected to the primary, and now we have to wait for the standby to reconnect and catch up. *These two tests seem to be the only TAP tests that do that*. So I think there's not really anything specific to commit_ts testing involved, it's just a dearth of primary restarts elsewhere. Looking at the logs in the failing cases, there's no evidence that the standby has even detected the primary's disconnection, which explains why it hasn't attempted to reconnect. For example, in the most recent HEAD failure, https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2022-01-03%2018%3A04%3A41 the standby reports successful connection: 2022-01-03 18:58:04.920 UTC [179700:1] LOG: started streaming WAL from primary at 0/3000000 on timeline 1 (which we can also see in the primary's log), but after that there's no log traffic at all except the test script's vain checks of pg_last_wal_replay_lsn(). In the same animal's immediately preceding successful run, https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=fairywren&dt=2022-01-03%2015%3A04%3A41&stg=module-commit_ts-check we see: 2022-01-03 15:59:24.186 UTC [176664:1] LOG: started streaming WAL from primary at 0/3000000 on timeline 1 2022-01-03 15:59:25.003 UTC [176664:2] LOG: replication terminated by primary server 2022-01-03 15:59:25.003 UTC [176664:3] DETAIL: End of WAL reached on timeline 1 at 0/3030CB8. 2022-01-03 15:59:25.003 UTC [176664:4] FATAL: could not send end-of-streaming message to primary: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. no COPY in progress 2022-01-03 15:59:25.005 UTC [177092:5] LOG: invalid record length at 0/3030CB8: wanted 24, got 0 ... 2022-01-03 15:59:25.564 UTC [177580:1] LOG: started streaming WAL from primary at 0/3000000 on timeline 1 So for some reason, on these machines detection of walsender-initiated connection close is unreliable ... or maybe, the walsender didn't close the connection, but is somehow still hanging around? Don't have much idea where to dig beyond that, but maybe someone else will. I wonder in particular if this could be related to our recent discussions about whether to use shutdown(2) on Windows --- could we need to do the equivalent of 6051857fc/ed52c3707 on walsender connections? regards, tom lane