Re: [HACKERS] make async slave to wait for lsn to be replayed

Alexander Korotkov Mon, 16 Sep 2024 11:56:22 -0700

Hi, Alexander!

On Fri, Sep 13, 2024 at 3:00 PM Alexander Lakhin <[email protected]> wrote:
>
> 10.08.2024 20:18, Alexander Korotkov wrote:
> > On Sat, Aug 10, 2024 at 7:33 PM Alexander Korotkov <[email protected]> 
> > wrote:
> >> On Tue, Aug 6, 2024 at 5:17 AM Alexander Korotkov <[email protected]> 
> >> wrote:
> >> ...
> >> Here is a revised version of the patchset.  I've fixed some typos,
> >> identation, etc.  I'm going to push this once it passes cfbot.
> > The next revison of the patchset fixes uninitialized variable usage
> > spotted by cfbot.
>
> When running check-world on a rather slow armv7 device, I came across the
> 043_wal_replay_wait.pl test failure:
> t/043_wal_replay_wait.pl .............. 7/? # Tests were run but no plan was 
> declared and done_testing() was not seen.
> # Looks like your test exited with 29 just after 8.
>
> regress_log_043_wal_replay_wait contains:
> ...
> 01234[21:58:56.370](1.594s) ok 7 - multiple LSN waiters reported consistent 
> data
> ### Promoting node "standby"
> # Running: pg_ctl -D 
> .../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata -l
> .../src/test/recovery/tmp_check/log/043_wal_replay_wait_standby.log promote
> waiting for server to promote.... done
> server promoted
> [21:58:56.637](0.268s) ok 8 - got error after standby promote
> error running SQL: 'psql:<stdin>:1: ERROR:  recovery is not in progress
> HINT:  Waiting for LSN can only be executed during recovery.'
> while running 'psql -XAtq -d port=10228 host=/tmp/Ftj8qpTQht 
> dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'CALL
> pg_wal_replay_wait('0/300D0E8');' at 
> .../src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line 
> 2140.
>
> 043_wal_replay_wait_standby.log contains:
> 2024-09-12 21:58:56.518 UTC [15220:1] [unknown] LOG:  connection received: 
> host=[local]
> 2024-09-12 21:58:56.520 UTC [15220:2] [unknown] LOG:  connection 
> authenticated: user="android" method=trust
> (.../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata/pg_hba.conf:117)
> 2024-09-12 21:58:56.520 UTC [15220:3] [unknown] LOG:  connection authorized: 
> user=android database=postgres
> application_name=043_wal_replay_wait.pl
> 2024-09-12 21:58:56.527 UTC [15220:4] 043_wal_replay_wait.pl LOG: statement: 
> CALL pg_wal_replay_wait('2/570CB4E8');
> 2024-09-12 21:58:56.535 UTC [15123:7] LOG:  received promote request
> 2024-09-12 21:58:56.535 UTC [15124:2] FATAL:  terminating walreceiver process 
> due to administrator command
> 2024-09-12 21:58:56.537 UTC [15123:8] LOG:  invalid record length at 
> 0/300D0B0: expected at least 24, got 0
> 2024-09-12 21:58:56.537 UTC [15123:9] LOG:  redo done at 0/300D088 system 
> usage: CPU: user: 0.01 s, system: 0.00 s,
> elapsed: 14.23 s
> 2024-09-12 21:58:56.537 UTC [15123:10] LOG:  last completed transaction was 
> at log time 2024-09-12 21:58:55.322831+00
> 2024-09-12 21:58:56.540 UTC [15123:11] LOG:  selected new timeline ID: 2
> 2024-09-12 21:58:56.589 UTC [15123:12] LOG:  archive recovery complete
> 2024-09-12 21:58:56.590 UTC [15220:5] 043_wal_replay_wait.pl ERROR: recovery 
> is not in progress
> 2024-09-12 21:58:56.590 UTC [15220:6] 043_wal_replay_wait.pl DETAIL:  
> Recovery ended before replaying target LSN
> 2/570CB4E8; last replay LSN 0/300D0B0.
> 2024-09-12 21:58:56.591 UTC [15121:1] LOG:  checkpoint starting: force
> 2024-09-12 21:58:56.592 UTC [15220:7] 043_wal_replay_wait.pl LOG: 
> disconnection: session time: 0:00:00.075 user=android
> database=postgres host=[local]
> 2024-09-12 21:58:56.595 UTC [15120:4] LOG:  database system is ready to 
> accept connections
> 2024-09-12 21:58:56.665 UTC [15227:1] [unknown] LOG:  connection received: 
> host=[local]
> 2024-09-12 21:58:56.668 UTC [15227:2] [unknown] LOG:  connection 
> authenticated: user="android" method=trust
> (.../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata/pg_hba.conf:117)
> 2024-09-12 21:58:56.668 UTC [15227:3] [unknown] LOG:  connection authorized: 
> user=android database=postgres
> application_name=043_wal_replay_wait.pl
> 2024-09-12 21:58:56.675 UTC [15227:4] 043_wal_replay_wait.pl LOG: statement: 
> CALL pg_wal_replay_wait('0/300D0E8');
> 2024-09-12 21:58:56.677 UTC [15227:5] 043_wal_replay_wait.pl ERROR: recovery 
> is not in progress
> 2024-09-12 21:58:56.677 UTC [15227:6] 043_wal_replay_wait.pl HINT: Waiting 
> for LSN can only be executed during recovery.
> 2024-09-12 21:58:56.679 UTC [15227:7] 043_wal_replay_wait.pl LOG: 
> disconnection: session time: 0:00:00.015 user=android
> database=postgres host=[local]
>
> Note that last replay LSN is 300D0B0, but the latter pg_wal_replay_wait
> call wants to wait for 300D0E8.
>
> pg_waldump -p 
> src/test/recovery/tmp_check/t_043_wal_replay_wait_primary_data/pgdata/pg_wal/ 
> 000000010000000000000003
> shows:
> rmgr: Heap        len (rec/tot):     59/    59, tx:        748, lsn: 
> 0/0300D048, prev 0/0300D020, desc: INSERT off: 35,
> flags: 0x00, blkref #0: rel 1663/5/16384 blk 0
> rmgr: Transaction len (rec/tot):     34/    34, tx:        748, lsn: 
> 0/0300D088, prev 0/0300D048, desc: COMMIT
> 2024-09-12 21:58:55.322831 UTC
> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 
> 0/0300D0B0, prev 0/0300D088, desc: RUNNING_XACTS
> nextXid 749 latestCompletedXid 748 oldestRunningXid 749
>
> I could reproduce this failure on my workstation with bgworker modified
> as below:
> --- a/src/backend/postmaster/bgwriter.c
> +++ b/src/backend/postmaster/bgwriter.c
> @@ -69 +69 @@ int            BgWriterDelay = 200;
> -#define LOG_SNAPSHOT_INTERVAL_MS 15000
> +#define LOG_SNAPSHOT_INTERVAL_MS 15
> @@ -307 +307 @@ BackgroundWriterMain(char *startup_data, size_t 
> startup_data_len)
> -                       BgWriterDelay /* ms */ , WAIT_EVENT_BGWRITER_MAIN);
> +                       1 /* ms */ , WAIT_EVENT_BGWRITER_MAIN);
>
>
> When looking at the test, I noticed probably a typo in the test message:
> wait for already replayed LSN exists immediately ...
> shouldn't it be "exits" there (though maybe the whole phrase could be
> improved)?
>
> I also suspect that "achieve" is not suitable word in the context of LSNs
> and timeouts. Maybe you would find it appropriate to replace it with
> "reach"?


Thank you for your report!

Please find two patches attached.  The first one does minor cleanup
including misuse of words you've pointed.  The second one adds missing
wait_for_catchup().  That should fix the test failure you've spotted.
Please, check if it fixes an issue for you.

------
Regards,
Alexander Korotkov
Supabase

v1-0002-Add-missing-wait_for_catchup-call-in-043_wal_repl.patch
Description: Binary data

v1-0001-Minor-cleanup-related-to-pg_wal_replay_wait-proce.patch
Description: Binary data

Re: [HACKERS] make async slave to wait for lsn to be replayed

Reply via email to