14.03.2024 23:56, Tom Lane wrote:
Thomas Munro <thomas.mu...@gmail.com> writes:
On Fri, Mar 15, 2024 at 7:00 AM Alexander Lakhin <exclus...@gmail.com> wrote:
Could it be that the timeout (360 sec?) is just not enough for the test
under the current (changed due to switch to meson) conditions?
But you're right that under meson the test takes a lot longer, I guess
due to increased concurrency:
What it seems to be is highly variable.  Looking at calliphoridae's
last half dozen successful runs, I see reported times for
027_stream_regress anywhere from 183 to 704 seconds.  I wonder what
else is happening on that machine.  Also, this is probably not
helping anything:

                    'extra_config' => {
                                                       ...
                                                       'fsync = on'

I would suggest turning that off and raising wait_timeout a good
deal, and then we'll see if calliphoridae gets any more stable.

I could reproduce similar failures with
PG_TEST_EXTRA=wal_consistency_checking
only, running 5 tests in parallel on a slowed-down VM, so that test
duration increased to ~1900 seconds, but perhaps that buildfarm machine
has a different bottleneck (I/O?) or it's concurrent workload is not
uniform as in my experiments.

Meanwhile, I've analyzed failed test logs from buildfarm and calculated
the percentage of WAL replayed before timeout.
For instance, one of the failures:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-18%2022%3A36%3A40
standby_1.log:
2024-03-18 22:38:22.743 UTC [2010896][walreceiver][:0] LOG:  started streaming 
WAL from primary at 0/3000000 on timeline 1
...
2024-03-18 22:50:02.439 UTC [2004203][checkpointer][:0] LOG: recovery restart 
point at 0/E00E030
2024-03-18 22:50:02.439 UTC [2004203][checkpointer][:0] DETAIL: Last completed transaction was at log time 2024-03-18 22:41:26.647756+00. 2024-03-18 22:50:12.841 UTC [2010896][walreceiver][:0] FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly

primary.log:
2024-03-18 22:38:23.754 UTC [2012240][client backend][3/3:0] LOG: statement: 
GRANT ALL ON SCHEMA public TO public;
# ^^^ One of the first records produced by `make check`
...
2024-03-18 22:41:26.647 UTC [2174047][client backend][10/752:0] LOG:  statement: ALTER VIEW my_property_secure SET (security_barrier=false);
# ^^^ A record near the last completed transaction on standby
...
2024-03-18 22:44:13.226 UTC [2305844][client backend][22/3784:0] LOG:  
statement: DROP TABLESPACE regress_tblspace_renamed;
# ^^^ One of the last records produced by `make check`

\set t0 '22:38:23.754' \set t1 '22:44:13.226' \set tf '22:41:26.647756'
select extract(epoch from (:'tf'::time - :'t0'::time)) / extract(epoch from 
(:'t1'::time - :'t0'::time));
~52%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-18%2018%3A58%3A58
~48%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-18%2016%3A41%3A13
~43%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-18%2015%3A47%3A09
~36%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-15%2011%3A24%3A38
~87%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-03-17%2021%3A55%3A41
~36%

So it still looks like a performance-related issue to me. And yes,
fsync = off -> on greatly increases (~3x) the overall test duration in
that my environment.

Best regards,
Alexander


Reply via email to