On 2024-07-25 Th 6:33 PM, Andrew Dunstan wrote:
On 2024-07-25 Th 5:14 PM, Tom Lane wrote:
I wrote:
I'm confused by crake's buildfarm logs. AFAICS it is not running
recovery-check at all in most of the runs; at least there is no
mention of that step, for example here:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-07-25%2013%3A27%3A02
Oh, I see it: the log file that is called recovery-check in a
failing run is called misc-check if successful. That seems
mighty bizarre, and it's not how my own animals behave.
Something weird about the meson code path, perhaps?
Yes, it was discussed some time ago. As suggested by Andres, we run
the meson test suite more or less all together (in effect like "make
checkworld" but without the main regression suite, which is run on its
own first), rather than in the separate (and serialized) way we do
with the configure/make tests. That results in significant speedup. If
the tests fail we report the failure as happening at the first failure
we encounter, which is possibly less than ideal, but I haven't got a
better idea.
Anyway, in this successful run:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2024-07-25%2018%3A57%3A02&stg=misc-check
here are some salient test timings:
1/297 postgresql:pg_upgrade / pg_upgrade/001_basic
OK 0.18s 9 subtests passed
2/297 postgresql:pg_upgrade / pg_upgrade/003_logical_slots
OK 15.95s 12 subtests passed
3/297 postgresql:pg_upgrade / pg_upgrade/004_subscription
OK 16.29s 14 subtests passed
17/297 postgresql:isolation / isolation/isolation
OK 71.60s 119 subtests passed
41/297 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
OK 169.13s 18 subtests passed
140/297 postgresql:initdb / initdb/001_initdb
OK 41.34s 52 subtests passed
170/297 postgresql:recovery / recovery/027_stream_regress
OK 469.49s 9 subtests passed
while in the next, failing run
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2024-07-25%2020%3A18%3A05&stg=recovery-check
the same tests took:
1/297 postgresql:pg_upgrade / pg_upgrade/001_basic
OK 0.22s 9 subtests passed
2/297 postgresql:pg_upgrade / pg_upgrade/003_logical_slots
OK 56.62s 12 subtests passed
3/297 postgresql:pg_upgrade / pg_upgrade/004_subscription
OK 71.92s 14 subtests passed
21/297 postgresql:isolation / isolation/isolation
OK 299.12s 119 subtests passed
31/297 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
OK 344.42s 18 subtests passed
159/297 postgresql:initdb / initdb/001_initdb
OK 344.46s 52 subtests passed
162/297 postgresql:recovery / recovery/027_stream_regress
ERROR 840.84s exit status 29
Based on this, it seems fairly likely that crake is simply timing out
as a consequence of intermittent heavy background activity.
The latest failure is this:
Waiting for replication conn standby_1's replay_lsn to pass 2/88E4E260 on
primary
[16:40:29.481](208.545s) # poll_query_until timed out executing this query:
# SELECT '2/88E4E260' <= replay_lsn AND state = 'streaming'
# FROM pg_catalog.pg_stat_replication
# WHERE application_name IN ('standby_1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
timed out waiting for catchup at
/home/andrew/bf/root/HEAD/pgsql/src/test/recovery/t/027_stream_regress.pl line
103.
Maybe it's a case where the system is overloaded, I dunno. I wouldn't bet my
house on it. Pretty much nothing else runs on this machine.
I have added a mild throttle to the buildfarm config so it doesn't try
to run every branch at once. Maybe I also need to bring down the
number or meson jobs too? But I suspect there's something deeper.
Prior to the failure of this test 10 days ago it hadn't failed in a
very long time. The OS was upgraded a month ago. Eight or so days ago
I changed PG_TEST_EXTRA. I can't think of anything else.
There seem to be a bunch of recent failures, and not just on crake, and
not just on HEAD:
<https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=14&member=&stage=recoveryCheck&filter=Submit>
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com