Dear hackers, I did some benchmarks with the patch. More detail, a pub-sub replication system was built and TPS was measured on the subscriber. Results were shown that the performance can be degraded if the wal_receiver_status_interval is long. This is expected behavior because the patch retains more dead tuples on the subscriber side. Also, we've considered a new mechanism which dynamically tunes the period of status request, and confirmed that it reduced the regression. This is described the latter part.
Below part contained the detailed report. ## Motivation - why the benchmark is needed V15 patch set introduces a new replication slot on the subscriber side to retain needed tuples for the update_deleted detection. However, this may affect the performance of query executions on the subscriber because 1) tuples to be scaned will be increased and 2) HOT update cannot be worked. The second issue comes from the fact that HOT update can work only when both tuples can be located on the same page. Based on above reasons I ran benchmark tests for the subscriber. The variable of the measurement is the wal_receiver_status_interval, which controls the duration of status request. ## Used source code HEAD was 962da900, and applied v15 patch set atop it. ## Environment RHEL 7 machine which has 755GB memory, 4 physical CPUs and 120 logical processors. ## Workload 1. Constructed a pub-sub replication system. Parameters for both instances were: share_buffers = 30GB min_wal_size = 10GB max_wal_size = 20GB autovacuum = false track_commit_timestamp = on (only for subscriber) 2. Ran pgbench with initialize mode. The scale factor was set to 100. 3. Ran pgbench with 60 clients for the subscriber. The duration was 120s, and the measurement was repeated 5 times. Attached script can automate above steps. You can specify the source type in the measure.sh and run it. ## Comparison The performance testing was done for HEAD and patched source code. In case of patched, "detect_update_deleted" parameter was set to on. Also, the parameter "wal_receiver_status_interval" was varied to 1s, 10s, and 100s to check the effect. Appendix table shows results [1]. The regression becomes larger based on the wal_receiver_status_interval. TPS regressions are almost 5%(interval=1s) -> 25%(intervals=10s) -> 55% (intervals=55%). Attached png file visualize the result: each bar shows median. ## Analysis I attached to the backend via perf and found that heapam_index_fetch_tuple() consumed much CPU time ony in case of patched [2]. Also, I checked pg_stat_all_tables view and found that HOT update rarely happened only in the patched case [3]. This means that whether backend could do HOT update is the dominant. When the detect_update_deleted = on, the additional slot is defined on the subscriber ide and it is updated based on the activity; The frequency is determined by the wal_receiver_status_intervals. In intervals=100s case, it is relatively larger for the workload so that some dead tuples remained, this makes query processing slower. This result means that users may have to tune consider the interval period based on their workload. However, it is difficult to predict the appropriate value. ## Experiment - dynamic period tuning Based on above, I and Hou discussed off-list and implemented new mechanism which tunes the duration between status request dynamically. The basic idea is similar with what slotsync worker does. The interval of requests is initially 100ms, and becomes twice when if there are no XID assignment since the last advancement. The maxium value is either of wal_receiver_status_interval or 3min. Benchmark results with this are shown in [4]. Here wal_receiver_status_interval is not changed, so we can compare with the HEAD and interval=10s case in [1] - 59536 vs 59597. The regression is less than 1%. The idea has already been included in v16-0002, please refer it. ## Experiment - shorter interval Just in case - I did an extreme case that wal_receiver_status_interval is quite short - 10ms. To make interval shorter I implemented an attached patch for both cases. Results are shown [5]. The regression is not observed or even better (I think this is caused by the randomness). This experiment also shows the result that the regression is happened due to the dead tuple. ## Appendix [1] - result table Each cells show transaction per seconds of the run. patched # run interval=1s intervals=10s intervals=100s 1 55876 45288 26956 2 56086 45336 26799 3 56121 45129 26753 4 56310 45169 26542 5 55389 45071 26735 median 56086 45169 26753 HEAD # run interval=1s intervals=10s intervals=100s 1 59096 59343 59341 2 59671 59413 59281 3 59131 59597 58966 4 59239 59693 59518 5 59165 59631 59487 median 59165 59597 59341 ## Appendix [2] - perf analysis patched: ``` - 58.29% heapam_index_fetch_tuple + 38.28% heap_hot_search_buffer + 13.88% ReleaseAndReadBuffer 5.34% heap_page_prune_opt + 13.88% ReleaseAndReadBuffer ``` head: ``` - 2.13% heapam_index_fetch_tuple 1.06% heap_hot_search_buffer 0.62% heap_page_prune_opt ``` ## Appendix [3] - pg_stat patched ``` postgres=# SELECT relname, n_tup_upd, n_tup_hot_upd, n_tup_newpage_upd, n_tup_upd - n_tup_hot_upd AS n_tup_non_hot FROM pg_stat_all_tables where relname like 'pgbench%'; relname | n_tup_upd | n_tup_hot_upd | n_tup_newpage_upd | n_tup_non_hot ------------------+-----------+---------------+-------------------+--------------- pgbench_history | 0 | 0 | 0 | 0 pgbench_tellers | 453161 | 37996 | 415165 | 415165 pgbench_accounts | 453161 | 0 | 453161 | 453161 pgbench_branches | 453161 | 272853 | 180308 | 180308 (4 rows) ``` head ``` postgres=# SELECT relname, n_tup_upd, n_tup_hot_upd, n_tup_newpage_upd, n_tup_upd - n_tup_hot_upd AS n_tup_non_hot FROM pg_stat_all_tables where relname like 'pgbench%'; relname | n_tup_upd | n_tup_hot_upd | n_tup_newpage_upd | n_tup_non_hot ------------------+-----------+---------------+-------------------+--------------- pgbench_history | 0 | 0 | 0 | 0 pgbench_tellers | 2078197 | 2077583 | 614 | 614 pgbench_accounts | 2078197 | 1911535 | 166662 | 166662 pgbench_branches | 2078197 | 2078197 | 0 | 0 (4 rows) ``` ## Appendix [4] - dynamic status request # run dynamic (v15 + PoC) 1 59627 2 59536 3 59359 4 59443 5 59541 median 59536 ## Apendix [5] - shorter wal_receiver_status_interval pached # run interval=10ms 1 58081 2 57876 3 58083 4 57915 5 57933 median 57933 head # run interval=10ms 1 57595 2 57322 3 57271 4 57421 5 57590 median 57421 Best regards, Hayato Kuroda FUJITSU LIMITED
change_to_ms.diffs
Description: change_to_ms.diffs
measure.sh
Description: measure.sh
setup.sh
Description: setup.sh