While looking at recent failures in the new 028_pitr_timelines.pl
recovery test, I noticed that there have been a few failures in the
buildfarm in the recoveryCheck phase even before that, in the
019_replslot_limit.pl test.
For example:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2022-02-14%2006%3A30%3A04
[07:42:23] t/018_wal_optimize.pl ................ ok 12403 ms ( 0.00
usr 0.00 sys + 1.40 cusr 0.63 csys = 2.03 CPU)
# poll_query_until timed out executing this query:
# SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'
# expecting this output:
# lost
# last actual query output:
# unreserved
and:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2022-02-15%2011%3A00%3A08
# Failed test 'have walsender pid 3682154
# 3682136'
# at t/019_replslot_limit.pl line 335.
# '3682154
# 3682136'
# doesn't match '(?^:^[0-9]+$)'
The latter looks like there are two walsenders active, which confuses
the test. Not sure what's happening in the first case, but looks like
some kind of a race condition at a quick glance.
Has anyone looked into these yet?
- Heikki