Hi, On 2021-10-13 11:03:39 -0700, Andres Freund wrote: > It seems that 026_overwrite_contrecord.pl test often fails under valgrind. I > first thought that related failures on skink were due to me migrating the > animal to a new host (and then running out of space due to a mistake in ccache > config). But it happened again after I fixed those, and I just reproduced the > issue locally. > > It's a bit odd that we didn't start to see these failures immediately, but > only in the last few days. I'd blame skink being migrated to a new home, > except that I can see the issue locally. > > FWIW, the way skink runs all postgres instances through valgrind is by > replacing the postgres binary. Here's my local version of that: > > andres@awork3:~/build/postgres/dev-assert/vpath$ cat > tmp_install/home/andres/build/postgres/dev-assert/install/bin/postgres > #!/bin/bash > > exec /usr/bin/valgrind \ > --quiet \ > --error-exitcode=128 \ > --suppressions=/home/andres/src/postgresql/src/tools/valgrind.supp \ > --trace-children=yes --track-origins=yes --read-var-info=no \ > --leak-check=no \ > --run-libc-freeres=no \ > --vgdb=no \ > --error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END \ > > /home/andres/build/postgres/dev-assert/vpath/tmp_install/home/andres/build/postgres/dev-assert/install/bin/postgres.orig > \ > "$@" > > make -C src/test/recovery/ check PROVE_FLAGS='-v' > PROVE_TESTS='t/026_overwrite_contrecord.pl' NO_TEMP_INSTALL=1 > ... > > not ok 1 - 000000010000000000000002 differs from 000000010000000000000002 > > # Failed test '000000010000000000000002 differs from > 000000010000000000000002'
I added LSNs to the error message: not ok 1 - 000000010000000000000002 (0/2002350) differs from 000000010000000000000002 (0/2099600) It appears that the problem is that inbetween the determination of rows_walsize the insert LSN moved to the next segment separately from the insertion, presumably due to autovacuum/analayze or such. <retries, with log_autovacuum_min_duration_statement=0, log_min_duration_statement=0> 2021-10-13 11:23:25.659 PDT [1491455] 026_overwrite_contrecord.pl LOG: statement: insert into filler select * from generate_series(1, 1000) 2021-10-13 11:23:26.467 PDT [1491455] 026_overwrite_contrecord.pl LOG: duration: 861.112 ms 2021-10-13 11:23:27.055 PDT [1491458] 026_overwrite_contrecord.pl LOG: statement: select pg_current_wal_insert_lsn() - '0/0' 2021-10-13 11:23:27.357 PDT [1491458] 026_overwrite_contrecord.pl LOG: duration: 347.888 ms 2021-10-13 11:23:27.980 PDT [1491461] 026_overwrite_contrecord.pl LOG: statement: WITH setting AS ( SELECT setting::int AS wal_segsize FROM pg_settings WHERE name = 'wal_segment_size' ) INSERT INTO filler SELECT g FROM setting, generate_series(1, 1000 * (wal_segsize - ((pg_current_wal_insert_lsn() - '0/0') % wal_segsize)) / 64232) g 2021-10-13 11:24:25.173 PDT [1491550] LOG: automatic analyze of table "postgres.public.filler" avg read rate: 3.185 MB/s, avg write rate: 0.039 MB/s buffer usage: 96 hits, 731 misses, 9 dirtied system usage: CPU: user: 1.79 s, system: 0.00 s, elapsed: 1.81 s 2021-10-13 11:24:26.255 PDT [1491461] 026_overwrite_contrecord.pl LOG: duration: 58360.811 ms 2021-10-13 11:24:26.857 PDT [1491557] 026_overwrite_contrecord.pl LOG: statement: SELECT pg_current_wal_insert_lsn() 2021-10-13 11:24:27.120 PDT [1491557] 026_overwrite_contrecord.pl LOG: duration: 300.562 ms Hm. I guess we can disable autovac. But that's not a great solution, there might be WAL files due to catalog access etc too. Seems like it might be worth doing the "filling" of the segment with a loop in a DO block instead, where the end condition is to be within some distance to the end of the segment? With plenty headroom? Another thing: filling a segment by inserting lots of very tiny rows is pretty expensive. Can't we use something a bit wider? Perhaps even emit_message? Greetings, Andres Freund