Thomas Munro <thomas.mu...@gmail.com> writes: > Hmm, one thing I'm still unclear on: did this problem really start > with 6051857fc/ed52c3707? My initial email in this thread lists > similar failures going back further, doesn't it? (And what's tern > doing mixed up in this mess?)
Well, those earlier ones may be committs failures, but a lot of them contain different-looking symptoms, eg pg_ctl failures. tern's failure at https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2021-07-17+10%3A57%3A49 does look similar, but we can see in its log that the standby *did* notice the primary disconnection and then reconnect: 2021-07-17 16:29:08.248 UTC [17498380:2] LOG: replication terminated by primary server 2021-07-17 16:29:08.248 UTC [17498380:3] DETAIL: End of WAL reached on timeline 1 at 0/30378F8. 2021-07-17 16:29:08.248 UTC [17498380:4] FATAL: could not send end-of-streaming message to primary: no COPY in progress 2021-07-17 16:29:08.248 UTC [25166230:5] LOG: invalid record length at 0/30378F8: wanted 24, got 0 2021-07-17 16:29:08.350 UTC [16318578:1] FATAL: could not connect to the primary server: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 2021-07-17 16:29:36.369 UTC [7077918:1] FATAL: could not connect to the primary server: FATAL: the database system is starting up 2021-07-17 16:29:36.380 UTC [11338028:1] FATAL: could not connect to the primary server: FATAL: the database system is starting up ... 2021-07-17 16:29:36.881 UTC [17367092:1] LOG: started streaming WAL from primary at 0/3000000 on timeline 1 So I'm not sure what happened there, but it's not an instance of this problem. One thing that looks a bit suspicious is this in the primary's log: 2021-07-17 16:26:47.832 UTC [12386550:1] LOG: using stale statistics instead of current ones because stats collector is not responding which makes me wonder if the timeout is down to out-of-date pg_stats data. The loop in 002_standby.pl doesn't appear to depend on the stats collector: my $primary_lsn = $primary->safe_psql('postgres', 'select pg_current_wal_lsn()'); $standby->poll_query_until('postgres', qq{SELECT '$primary_lsn'::pg_lsn <= pg_last_wal_replay_lsn()}) or die "standby never caught up"; but maybe I'm missing the connection. Apropos of that, it's worth noting that wait_for_catchup *is* dependent on up-to-date stats, and here's a recent run where it sure looks like the timeout cause is AWOL stats collector: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2022-01-10%2004%3A51%3A34 I wonder if we should refactor wait_for_catchup to probe the standby directly instead of relying on the upstream's view. regards, tom lane