Hi, Alexander! On Fri, Sep 13, 2024 at 3:00 PM Alexander Lakhin <exclus...@gmail.com> wrote: > > 10.08.2024 20:18, Alexander Korotkov wrote: > > On Sat, Aug 10, 2024 at 7:33 PM Alexander Korotkov <aekorot...@gmail.com> > > wrote: > >> On Tue, Aug 6, 2024 at 5:17 AM Alexander Korotkov <aekorot...@gmail.com> > >> wrote: > >> ... > >> Here is a revised version of the patchset. I've fixed some typos, > >> identation, etc. I'm going to push this once it passes cfbot. > > The next revison of the patchset fixes uninitialized variable usage > > spotted by cfbot. > > When running check-world on a rather slow armv7 device, I came across the > 043_wal_replay_wait.pl test failure: > t/043_wal_replay_wait.pl .............. 7/? # Tests were run but no plan was > declared and done_testing() was not seen. > # Looks like your test exited with 29 just after 8. > > regress_log_043_wal_replay_wait contains: > ... > 01234[21:58:56.370](1.594s) ok 7 - multiple LSN waiters reported consistent > data > ### Promoting node "standby" > # Running: pg_ctl -D > .../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata -l > .../src/test/recovery/tmp_check/log/043_wal_replay_wait_standby.log promote > waiting for server to promote.... done > server promoted > [21:58:56.637](0.268s) ok 8 - got error after standby promote > error running SQL: 'psql:<stdin>:1: ERROR: recovery is not in progress > HINT: Waiting for LSN can only be executed during recovery.' > while running 'psql -XAtq -d port=10228 host=/tmp/Ftj8qpTQht > dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'CALL > pg_wal_replay_wait('0/300D0E8');' at > .../src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line > 2140. > > 043_wal_replay_wait_standby.log contains: > 2024-09-12 21:58:56.518 UTC [15220:1] [unknown] LOG: connection received: > host=[local] > 2024-09-12 21:58:56.520 UTC [15220:2] [unknown] LOG: connection > authenticated: user="android" method=trust > (.../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata/pg_hba.conf:117) > 2024-09-12 21:58:56.520 UTC [15220:3] [unknown] LOG: connection authorized: > user=android database=postgres > application_name=043_wal_replay_wait.pl > 2024-09-12 21:58:56.527 UTC [15220:4] 043_wal_replay_wait.pl LOG: statement: > CALL pg_wal_replay_wait('2/570CB4E8'); > 2024-09-12 21:58:56.535 UTC [15123:7] LOG: received promote request > 2024-09-12 21:58:56.535 UTC [15124:2] FATAL: terminating walreceiver process > due to administrator command > 2024-09-12 21:58:56.537 UTC [15123:8] LOG: invalid record length at > 0/300D0B0: expected at least 24, got 0 > 2024-09-12 21:58:56.537 UTC [15123:9] LOG: redo done at 0/300D088 system > usage: CPU: user: 0.01 s, system: 0.00 s, > elapsed: 14.23 s > 2024-09-12 21:58:56.537 UTC [15123:10] LOG: last completed transaction was > at log time 2024-09-12 21:58:55.322831+00 > 2024-09-12 21:58:56.540 UTC [15123:11] LOG: selected new timeline ID: 2 > 2024-09-12 21:58:56.589 UTC [15123:12] LOG: archive recovery complete > 2024-09-12 21:58:56.590 UTC [15220:5] 043_wal_replay_wait.pl ERROR: recovery > is not in progress > 2024-09-12 21:58:56.590 UTC [15220:6] 043_wal_replay_wait.pl DETAIL: > Recovery ended before replaying target LSN > 2/570CB4E8; last replay LSN 0/300D0B0. > 2024-09-12 21:58:56.591 UTC [15121:1] LOG: checkpoint starting: force > 2024-09-12 21:58:56.592 UTC [15220:7] 043_wal_replay_wait.pl LOG: > disconnection: session time: 0:00:00.075 user=android > database=postgres host=[local] > 2024-09-12 21:58:56.595 UTC [15120:4] LOG: database system is ready to > accept connections > 2024-09-12 21:58:56.665 UTC [15227:1] [unknown] LOG: connection received: > host=[local] > 2024-09-12 21:58:56.668 UTC [15227:2] [unknown] LOG: connection > authenticated: user="android" method=trust > (.../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata/pg_hba.conf:117) > 2024-09-12 21:58:56.668 UTC [15227:3] [unknown] LOG: connection authorized: > user=android database=postgres > application_name=043_wal_replay_wait.pl > 2024-09-12 21:58:56.675 UTC [15227:4] 043_wal_replay_wait.pl LOG: statement: > CALL pg_wal_replay_wait('0/300D0E8'); > 2024-09-12 21:58:56.677 UTC [15227:5] 043_wal_replay_wait.pl ERROR: recovery > is not in progress > 2024-09-12 21:58:56.677 UTC [15227:6] 043_wal_replay_wait.pl HINT: Waiting > for LSN can only be executed during recovery. > 2024-09-12 21:58:56.679 UTC [15227:7] 043_wal_replay_wait.pl LOG: > disconnection: session time: 0:00:00.015 user=android > database=postgres host=[local] > > Note that last replay LSN is 300D0B0, but the latter pg_wal_replay_wait > call wants to wait for 300D0E8. > > pg_waldump -p > src/test/recovery/tmp_check/t_043_wal_replay_wait_primary_data/pgdata/pg_wal/ > 000000010000000000000003 > shows: > rmgr: Heap len (rec/tot): 59/ 59, tx: 748, lsn: > 0/0300D048, prev 0/0300D020, desc: INSERT off: 35, > flags: 0x00, blkref #0: rel 1663/5/16384 blk 0 > rmgr: Transaction len (rec/tot): 34/ 34, tx: 748, lsn: > 0/0300D088, prev 0/0300D048, desc: COMMIT > 2024-09-12 21:58:55.322831 UTC > rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: > 0/0300D0B0, prev 0/0300D088, desc: RUNNING_XACTS > nextXid 749 latestCompletedXid 748 oldestRunningXid 749 > > I could reproduce this failure on my workstation with bgworker modified > as below: > --- a/src/backend/postmaster/bgwriter.c > +++ b/src/backend/postmaster/bgwriter.c > @@ -69 +69 @@ int BgWriterDelay = 200; > -#define LOG_SNAPSHOT_INTERVAL_MS 15000 > +#define LOG_SNAPSHOT_INTERVAL_MS 15 > @@ -307 +307 @@ BackgroundWriterMain(char *startup_data, size_t > startup_data_len) > - BgWriterDelay /* ms */ , WAIT_EVENT_BGWRITER_MAIN); > + 1 /* ms */ , WAIT_EVENT_BGWRITER_MAIN); > > > When looking at the test, I noticed probably a typo in the test message: > wait for already replayed LSN exists immediately ... > shouldn't it be "exits" there (though maybe the whole phrase could be > improved)? > > I also suspect that "achieve" is not suitable word in the context of LSNs > and timeouts. Maybe you would find it appropriate to replace it with > "reach"?
Thank you for your report! Please find two patches attached. The first one does minor cleanup including misuse of words you've pointed. The second one adds missing wait_for_catchup(). That should fix the test failure you've spotted. Please, check if it fixes an issue for you. ------ Regards, Alexander Korotkov Supabase
v1-0002-Add-missing-wait_for_catchup-call-in-043_wal_repl.patch
Description: Binary data
v1-0001-Minor-cleanup-related-to-pg_wal_replay_wait-proce.patch
Description: Binary data