002_standby.pl flaky?

Tom Lane Mon, 10 Jan 2022 13:25:45 -0800

Thomas Munro <[email protected]> writes:
> Hmm, one thing I'm still unclear on: did this problem really start
> with 6051857fc/ed52c3707?  My initial email in this thread lists
> similar failures going back further, doesn't it?  (And what's tern
> doing mixed up in this mess?)


Well, those earlier ones may be committs failures, but a lot of
them contain different-looking symptoms, eg pg_ctl failures.

tern's failure at
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2021-07-17+10%3A57%3A49
does look similar, but we can see in its log that the standby
*did* notice the primary disconnection and then reconnect:

2021-07-17 16:29:08.248 UTC [17498380:2] LOG:  replication terminated by 
primary server
2021-07-17 16:29:08.248 UTC [17498380:3] DETAIL:  End of WAL reached on 
timeline 1 at 0/30378F8.
2021-07-17 16:29:08.248 UTC [17498380:4] FATAL:  could not send 
end-of-streaming message to primary: no COPY in progress
2021-07-17 16:29:08.248 UTC [25166230:5] LOG:  invalid record length at 
0/30378F8: wanted 24, got 0
2021-07-17 16:29:08.350 UTC [16318578:1] FATAL:  could not connect to the 
primary server: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2021-07-17 16:29:36.369 UTC [7077918:1] FATAL:  could not connect to the 
primary server: FATAL:  the database system is starting up
2021-07-17 16:29:36.380 UTC [11338028:1] FATAL:  could not connect to the 
primary server: FATAL:  the database system is starting up
...
2021-07-17 16:29:36.881 UTC [17367092:1] LOG:  started streaming WAL from 
primary at 0/3000000 on timeline 1

So I'm not sure what happened there, but it's not an instance
of this problem.  One thing that looks a bit suspicious is
this in the primary's log:

2021-07-17 16:26:47.832 UTC [12386550:1] LOG:  using stale statistics instead 
of current ones because stats collector is not responding

which makes me wonder if the timeout is down to out-of-date
pg_stats data.  The loop in 002_standby.pl doesn't appear to
depend on the stats collector:

my $primary_lsn =
  $primary->safe_psql('postgres', 'select pg_current_wal_lsn()');
$standby->poll_query_until('postgres',
        qq{SELECT '$primary_lsn'::pg_lsn <= pg_last_wal_replay_lsn()})
  or die "standby never caught up";

but maybe I'm missing the connection.

Apropos of that, it's worth noting that wait_for_catchup *is*
dependent on up-to-date stats, and here's a recent run where
it sure looks like the timeout cause is AWOL stats collector:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2022-01-10%2004%3A51%3A34

I wonder if we should refactor wait_for_catchup to probe the
standby directly instead of relying on the upstream's view.

                        regards, tom lane

Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

Reply via email to