Hello Andres,
So it looks like the issue resolved, but there is another apparently
performance-related issue: deadlock-parallel test failures.
I reduced test concurrency a bit. I hadn't quite realized how the buildfarm
config and meson test concurrency interact. But there's still something off
with the frequency of fsyncs during replay, but perhaps that doesn't qualify
as a bug.
It looks like that set of animals is still suffering from extreme load.
Please take a look at the today's failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-06-04%2002%3A44%3A19
1/1 postgresql:regress-running / regress-running/regress TIMEOUT
3000.06s killed by signal 15 SIGTERM
inst/logfile ends with:
2024-06-04 03:39:24.664 UTC [3905755][client backend][5/1787:16793] ERROR: column "c2" of relation "test_add_column"
already exists
2024-06-04 03:39:24.664 UTC [3905755][client backend][5/1787:16793] STATEMENT:
ALTER TABLE test_add_column
ADD COLUMN c2 integer, -- fail because c2 already exists
ADD COLUMN c3 integer primary key;
2024-06-04 03:39:30.815 UTC [3905755][client backend][5/0:0] LOG: could not
send data to client: Broken pipe
2024-06-04 03:39:30.816 UTC [3905755][client backend][5/0:0] FATAL: connection
to client lost
"ALTER TABLE test_add_column" is from the alter_table test, which executed
in the group 21 out of 25.
Another similar failure:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2024-05-24%2002%3A22%3A26&stg=install-check-C
1/1 postgresql:regress-running / regress-running/regress TIMEOUT
3000.06s killed by signal 15 SIGTERM
inst/logfile ends with:
2024-05-24 03:18:51.469 UTC [998579][client backend][7/1792:16786] ERROR: could not change table "logged1" to unlogged
because it references logged table "logged2"
2024-05-24 03:18:51.469 UTC [998579][client backend][7/1792:16786] STATEMENT:
ALTER TABLE logged1 SET UNLOGGED;
(This is the alter_table test again.)
I've analyzed duration of the regress-running/regress test for the recent
167 runs on skink and found that the average duration is 1595 seconds, but
there were much longer test runs:
2979.39:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2024-05-01%2004%3A15%3A29&stg=install-check-C
2932.86:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2024-04-28%2018%3A57%3A37&stg=install-check-C
2881.78:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2024-05-15%2020%3A53%3A30&stg=install-check-C
So it seems that the default timeout is not large enough for these
conditions. (I've counted 10 such timeout failures of 167 test runs.)
Also, 027_stream_regress still fails due to the same reason:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-05-22%2021%3A55%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-05-22%2021%3A54%3A50
(It's remarkable that these two animals failed at the same time.)
Best regards,
Alexander