While looking at recent failures in the new 028_pitr_timelines.pl recovery test, I noticed that there have been a few failures in the buildfarm in the recoveryCheck phase even before that, in the 019_replslot_limit.pl test.

For example:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2022-02-14%2006%3A30%3A04

[07:42:23] t/018_wal_optimize.pl ................ ok 12403 ms ( 0.00 usr 0.00 sys + 1.40 cusr 0.63 csys = 2.03 CPU)
# poll_query_until timed out executing this query:
# SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'
# expecting this output:
# lost
# last actual query output:
# unreserved

and:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2022-02-15%2011%3A00%3A08

#   Failed test 'have walsender pid 3682154
# 3682136'
#   at t/019_replslot_limit.pl line 335.
#                   '3682154
# 3682136'
#     doesn't match '(?^:^[0-9]+$)'

The latter looks like there are two walsenders active, which confuses the test. Not sure what's happening in the first case, but looks like some kind of a race condition at a quick glance.

Has anyone looked into these yet?

- Heikki


Reply via email to