[PATCH] Avoid useless prefetches in case of recent FPI WAL records

Michael Zhilin Sat, 29 Mar 2025 15:58:03 -0700

Hi,

One customer complained that physical standby has significant replay lag(few hours) on modern powerful bare metal servers. According to outputof "top", startup process and few kernel workers utilized CPU, mostly inkernel space.

The incident has been resolved by disabling recovery_prefetch afterquick analysis of perf and strace outputs.


Problem description
---------------------

Actual XLog prefetching code doesn't consider that shared buffer may beloaded from WAL FPI rather than from disk and makes fadvise calls insuch cases.It results into amount of useless fadvise calls and slowness of WALapply process.

The full_page_writes are enabled by default, so after checkpoint anyfirst modification is logged into WAL with full page. The startupprocess will read full page images from WAL stream and load them intoshared buffer by XLogReadBufferForRedoExtended.

It seems there is no point to prefetch shared buffer from disk if FPIare logged recently.


Test case
----------

To reproduce case of useless prefetches, please create master instanceand physical standby with default shared buffer (128MiB), then inserttest data with size a bit more than shared buffers (32818 pages):


        drop table if exists t1;

create table t1 (id int, k int, val text ) with (fillfactor =10, autovacuum_enabled=off);

        insert into t1 (id,k,val)

select i, 0, repeat('x', 800) from generate_series(1,32*1024 +50) as i;

        checkpoint;

Reset stats on standby server:

        select pg_stat_reset_shared('io');
        select pg_stat_reset_shared('recovery_prefetch');

Update table on master and check stats on standby:

        checkpoint;
        update t1 set k = k + 1;

select object, context,reads, stats_reset from pg_stat_io wherebackend_type = 'startup';

        select * from pg_stat_recovery_prefetch ;

Here is output:

postgres=# select object, context,reads, stats_reset frompg_stat_io where backend_type = 'startup';

          object  |  context  | reads |          stats_reset
----------+-----------+-------+-------------------------------
         relation | bulkread  |     0 | 2025-03-29 16:38:45.664648+03
         relation | bulkwrite |     0 | 2025-03-29 16:38:45.664648+03
         relation | init      |     0 | 2025-03-29 16:38:45.664648+03
         relation | normal    |     9 | 2025-03-29 16:38:45.664648+03
         relation | vacuum    |     0 | 2025-03-29 16:38:45.664648+03
         wal      | init      |       | 2025-03-29 16:38:45.664648+03
         wal      | normal    |  8100 | 2025-03-29 16:38:45.664648+03
        (7 rows)

        postgres=# select * from pg_stat_recovery_prefetch ;

-------------------------------+----------+-----+-----------+----------+----------+----------+--------------+----------------+----------

2025-03-29 16:38:45.665322+03 | 32383 | 435 | 0 | 0| 32818 | 32818 | 0 | 0 | 0

        (1 row)

The startup process called fadvise 32383 times, skipped 32818 FPWblocks, but read relations only 9 times.


Idea how to fix it
-------------------
As of now XLogPrefetcher skips FPW blocks without any further tracking.

It seems that book-keeping of this block to avoid further prefetching ismore correct.

Please find attached patch for this idea. It adds block intorecent_block & recent_rlocator.


After patch the stats are following:

postgres=# select object, context,reads, stats_reset frompg_stat_io where backend_type = 'startup';

          object  |  context  | reads |          stats_reset
----------+-----------+-------+-------------------------------
         relation | bulkread  |     0 | 2025-03-30 01:27:45.981169+03
         relation | bulkwrite |     0 | 2025-03-30 01:27:45.981169+03
         relation | init      |     0 | 2025-03-30 01:27:45.981169+03
         relation | normal    |     9 | 2025-03-30 01:27:45.981169+03
         relation | vacuum    |     0 | 2025-03-30 01:27:45.981169+03
         wal      | init      |       | 2025-03-30 01:27:45.981169+03
         wal      | normal    |  7895 | 2025-03-30 01:27:45.981169+03
        (7 rows)

        postgres=#  select * from pg_stat_recovery_prefetch ;

-------------------------------+----------+-----+-----------+----------+----------+----------+--------------+----------------+----------

2025-03-30 01:27:46.971602+03 | 0 | 0 | 0 | 0| 32818 | 65636 | 0 | 0 | 0

        (1 row)

i.e. there is no more useless fadvise() calls and amount of skippedrepeated blocks are increased by amount of modified blocks.


Please feel free to ask any questions!

Thank you!

--
Michael Zhilin
Postgres Professional
https://www.postgrespro.ru

From 7367cce118e5196b7e916c1d8e8403198c5334a5 Mon Sep 17 00:00:00 2001
From: Michael Zhilin <miz...@freebsd.org>
Date: Sun, 30 Mar 2025 01:30:37 +0300
Subject: [PATCH v1] Avoid useless prefetches in case of recent FPI WAL records

If block is logged into WAL with full page, then the startup process reads
full page image from WAL stream and load them into shared buffer by
XLogReadBufferForRedoExtended. Every further WAL records related to
particular blocks should be skipped by XLogPrefetcher.
---
 src/backend/access/transam/xlogprefetcher.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
index 7735562db01..3378b72fd96 100644
--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -51,7 +51,7 @@
 
 /*
  * To detect repeated access to the same block and skip useless extra system
- * calls, we remember a small window of recently prefetched blocks.
+ * calls, we remember a small window of recently prefetched / FPW blocks.
  */
 #define XLOGPREFETCHER_SEQ_WINDOW_SIZE 4
 
@@ -674,6 +674,25 @@ XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
 			if (block->has_image)
 			{
 				XLogPrefetchIncrement(&SharedStats->skip_fpw);
+				/*
+				 * A full page image is copied to shared buffer during replay.
+				 * Every next WAL records related to particular block doesn't
+				 * require prefetching of shared buffer.
+				 */
+				for (int i = 0; i < XLOGPREFETCHER_SEQ_WINDOW_SIZE; ++i)
+				{
+					if (block->blkno == prefetcher->recent_block[i] &&
+						RelFileLocatorEquals(block->rlocator, prefetcher->recent_rlocator[i]))
+					{
+						return LRQ_NEXT_NO_IO;
+					}
+				}
+
+				prefetcher->recent_rlocator[prefetcher->recent_idx] = block->rlocator;
+				prefetcher->recent_block[prefetcher->recent_idx] = block->blkno;
+				prefetcher->recent_idx =
+					(prefetcher->recent_idx + 1) % XLOGPREFETCHER_SEQ_WINDOW_SIZE;
+
 				return LRQ_NEXT_NO_IO;
 			}
 
-- 
2.49.0

[PATCH] Avoid useless prefetches in case of recent FPI WAL records

Reply via email to