Hi,

One customer complained that physical standby has significant replay lag (few hours) on modern powerful bare metal servers. According to output of "top", startup process and few kernel workers utilized CPU, mostly in kernel space.

The incident has been resolved by disabling recovery_prefetch after quick analysis of perf and strace outputs.

Problem description
---------------------

Actual XLog prefetching code doesn't consider that shared buffer may be loaded from WAL FPI rather than from disk and makes fadvise calls in such cases. It results into amount of useless fadvise calls and slowness of WAL apply process.

The full_page_writes are enabled by default, so after checkpoint any first modification is logged into WAL with full page. The startup process will read full page images from WAL stream and load them into shared buffer by XLogReadBufferForRedoExtended.

It seems there is no point to prefetch shared buffer from disk if FPI are logged recently.

Test case
----------

To reproduce case of useless prefetches, please create master instance and physical standby with default shared buffer (128MiB), then insert test data with size a bit more than shared buffers (32818 pages):

        drop table if exists t1;
        create table t1 (id int, k int, val text ) with (fillfactor = 10, autovacuum_enabled=off);
        insert into t1 (id,k,val)
        select i, 0, repeat('x', 800) from generate_series(1,32*1024 + 50) as i;
        checkpoint;

Reset stats on standby server:

        select pg_stat_reset_shared('io');
        select pg_stat_reset_shared('recovery_prefetch');

Update table on master and check stats on standby:

        checkpoint;
        update t1 set k = k + 1;

        select object, context,reads, stats_reset from pg_stat_io where backend_type = 'startup';
        select * from pg_stat_recovery_prefetch ;

Here is output:

        postgres=#  select object, context,reads, stats_reset from pg_stat_io where backend_type = 'startup';
          object  |  context  | reads |          stats_reset
----------+-----------+-------+-------------------------------
         relation | bulkread  |     0 | 2025-03-29 16:38:45.664648+03
         relation | bulkwrite |     0 | 2025-03-29 16:38:45.664648+03
         relation | init      |     0 | 2025-03-29 16:38:45.664648+03
         relation | normal    |     9 | 2025-03-29 16:38:45.664648+03
         relation | vacuum    |     0 | 2025-03-29 16:38:45.664648+03
         wal      | init      |       | 2025-03-29 16:38:45.664648+03
         wal      | normal    |  8100 | 2025-03-29 16:38:45.664648+03
        (7 rows)

        postgres=# select * from pg_stat_recovery_prefetch ;
                  stats_reset          | prefetch | hit | skip_init | skip_new | skip_fpw | skip_rep | wal_distance | block_distance | io_depth
-------------------------------+----------+-----+-----------+----------+----------+----------+--------------+----------------+----------
         2025-03-29 16:38:45.665322+03 |    32383 | 435 | 0 |        0 |    32818 |    32818 |            0 |              0 |        0
        (1 row)

The startup process called fadvise 32383 times, skipped 32818 FPW blocks, but read relations only 9 times.

Idea how to fix it
-------------------
As of now XLogPrefetcher skips FPW blocks without any further tracking.
It seems that book-keeping of this block to avoid further prefetching is more correct.

Please find attached patch for this idea. It adds block into recent_block & recent_rlocator.

After patch the stats are following:

        postgres=#  select object, context,reads, stats_reset from pg_stat_io where backend_type = 'startup';
          object  |  context  | reads |          stats_reset
----------+-----------+-------+-------------------------------
         relation | bulkread  |     0 | 2025-03-30 01:27:45.981169+03
         relation | bulkwrite |     0 | 2025-03-30 01:27:45.981169+03
         relation | init      |     0 | 2025-03-30 01:27:45.981169+03
         relation | normal    |     9 | 2025-03-30 01:27:45.981169+03
         relation | vacuum    |     0 | 2025-03-30 01:27:45.981169+03
         wal      | init      |       | 2025-03-30 01:27:45.981169+03
         wal      | normal    |  7895 | 2025-03-30 01:27:45.981169+03
        (7 rows)

        postgres=#  select * from pg_stat_recovery_prefetch ;
                  stats_reset          | prefetch | hit | skip_init | skip_new | skip_fpw | skip_rep | wal_distance | block_distance | io_depth
-------------------------------+----------+-----+-----------+----------+----------+----------+--------------+----------------+----------
         2025-03-30 01:27:46.971602+03 |        0 |   0 | 0 |        0 |    32818 |    65636 |            0 |              0 |        0
        (1 row)

i.e. there is no more useless fadvise() calls and amount of skipped repeated blocks are increased by amount of modified blocks.

Please feel free to ask any questions!

Thank you!

--
Michael Zhilin
Postgres Professional
https://www.postgrespro.ru
From 7367cce118e5196b7e916c1d8e8403198c5334a5 Mon Sep 17 00:00:00 2001
From: Michael Zhilin <miz...@freebsd.org>
Date: Sun, 30 Mar 2025 01:30:37 +0300
Subject: [PATCH v1] Avoid useless prefetches in case of recent FPI WAL records

If block is logged into WAL with full page, then the startup process reads
full page image from WAL stream and load them into shared buffer by
XLogReadBufferForRedoExtended. Every further WAL records related to
particular blocks should be skipped by XLogPrefetcher.
---
 src/backend/access/transam/xlogprefetcher.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
index 7735562db01..3378b72fd96 100644
--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -51,7 +51,7 @@
 
 /*
  * To detect repeated access to the same block and skip useless extra system
- * calls, we remember a small window of recently prefetched blocks.
+ * calls, we remember a small window of recently prefetched / FPW blocks.
  */
 #define XLOGPREFETCHER_SEQ_WINDOW_SIZE 4
 
@@ -674,6 +674,25 @@ XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
 			if (block->has_image)
 			{
 				XLogPrefetchIncrement(&SharedStats->skip_fpw);
+				/*
+				 * A full page image is copied to shared buffer during replay.
+				 * Every next WAL records related to particular block doesn't
+				 * require prefetching of shared buffer.
+				 */
+				for (int i = 0; i < XLOGPREFETCHER_SEQ_WINDOW_SIZE; ++i)
+				{
+					if (block->blkno == prefetcher->recent_block[i] &&
+						RelFileLocatorEquals(block->rlocator, prefetcher->recent_rlocator[i]))
+					{
+						return LRQ_NEXT_NO_IO;
+					}
+				}
+
+				prefetcher->recent_rlocator[prefetcher->recent_idx] = block->rlocator;
+				prefetcher->recent_block[prefetcher->recent_idx] = block->blkno;
+				prefetcher->recent_idx =
+					(prefetcher->recent_idx + 1) % XLOGPREFETCHER_SEQ_WINDOW_SIZE;
+
 				return LRQ_NEXT_NO_IO;
 			}
 
-- 
2.49.0

Reply via email to