On 7/1/19 12:55 PM, Jamison, Kirk wrote: > On Wednesday, June 26, 2019 6:10 PM(GMT+9), Adrien Nayrat wrote: >> As far as I remember, you should see "relation" wait events (type lock) on >> standby server. This is due to startup process acquiring AccessExclusiveLock >> for the truncation and other backend waiting to acquire a lock to read the >> table. > > Hi Adrien, thank you for taking time to reply. > > I understand that RelationTruncate() can block read-only queries on > standby during redo. However, it's difficult for me to reproduce the > test case where I need to catch that wait for relation lock, because > one has to execute SELECT within the few milliseconds of redoing the > truncation of one table.
Yes, that why your test by measuring vacuum execution time is better as it is
more reproductible.
>
> Instead, I just measured the whole recovery time, smgr_redo(),
> to show the recovery improvement compared to head. Please refer below.
>
> [Recovery Test]
> I used the same stored functions and configurations in the previous email
> & created "test" db.
>
> $ createdb test
> $ psql -d test
>
> 1. [Primary] Create 10,000 relations.
> test=# SELECT create_tables(10000);
>
> 2. [P] Insert one row in each table.
> test=# SELECT insert_tables(10000);
>
> 3. [P] Delete row of each table.
> test=# SELECT delfrom_tables(10000);
>
> 4. [Standby] WAL application is stopped at Standby server.
> test=# SELECT pg_wal_replay_pause();
>
> 5. [P] VACUUM is executed at Primary side, and measure its execution time.
>
> test=# \timing on
> test=# VACUUM;
>
> Alternatively, you may use:
> $ time psql -d test -c 'VACUUM;'
> (Note: WAL has not replayed on standby because it's been paused.)
>
> 6. [P] Wait until VACUUM has finished execution. Then, stop primary server.
> test=# pg_ctl stop -w
>
> 7. [S] Resume WAL replay, then promote standby (failover).
> I used a shell script to execute recovery & promote standby server
> because it's kinda difficult to measure recovery time. Please refer to the
> script below.
> - "SELECT pg_wal_replay_resume();" is executed and the WAL application is
> resumed.
> - "pg_ctl promote" to promote standby.
> - The time difference of "select pg_is_in_recovery();" from "t" to "f" is
> measured.
>
> shell script:
>
> PGDT=/path_to_storage_directory/
>
> if [ "$1" = "resume" ]; then
> psql -c "SELECT pg_wal_replay_resume();" test
> date +%Y/%m/%d_%H:%M:%S.%3N
> pg_ctl promote -D ${PGDT}
> set +x
> date +%Y/%m/%d_%H:%M:%S.%3N
> while [ 1 ]
> do
> RS=`psql -Atc "select pg_is_in_recovery();" test`
> if [ ${RS} = "f" ]; then
> break
> fi
> done
> date +%Y/%m/%d_%H:%M:%S.%3N
> set -x
> exit 0
> fi
>
>
> [Test Results]
> shared_buffers = 24GB
>
> 1. HEAD
> (wal replay resumed)
> 2019/07/01_08:48:50.326
> server promoted
> 2019/07/01_08:49:50.482
> 2019/07/01_09:02:41.051
>
> Recovery Time:
> 13 min 50.725 s -> Time difference from WAL replay to complete recovery
> 12 min 50.569 s -> Time difference of "select pg_is_in_recovery();" from "t"
> to "f"
>
> 2. PATCH
> (wal replay resumed)
> 2019/07/01_07:34:26.766
> server promoted
> 2019/07/01_07:34:57.790
> 2019/07/01_07:34:57.809
>
> Recovery Time:
> 31.043 s -> Time difference from WAL replay to complete recovery
> 00.019 s -> Time difference of "select pg_is_in_recovery();" from "t" to "f"
>
> [Conclusion]
> The recovery time significantly improved compared to head
> from 13 minutes to 30 seconds.
>
> Any thoughts?
> I'd really appreciate your comments/feedback about the patch and/or test.
>
>
Thanks for the time you spend on this test, it is a huge win!
Although creating 10k tables and deleting tuples is not a common use case, it is
still good to know how your patch performs.
I will try to look deeper in your patch, but my knowledge on postgres internal
are limited :)
--
Adrien
signature.asc
Description: OpenPGP digital signature
