On 7/23/25 02:59, Andres Freund wrote:
> Hi,
> 
> On 2025-07-23 02:50:04 +0200, Tomas Vondra wrote:
>> But I don't see why would this have any effect on the prefetch distance,
>> queue depth etc. Or why decreasing INDEX_SCAN_MAX_BATCHES should improve
>> that. I'd have expected exactly the opposite behavior.
>>
>> Could be bug, of course. But it'd be helpful to see the dataset/query.
> 
> Pgbench scale 500, with the simpler query from my message.
> 

I tried to reproduce this, but I'm not seeing behavior. I'm not sure how
you monitor the queue depth (presumably iostat?), but I added a basic
prefetch info to explain (see the attached WIP patch), reporting the
average prefetch distance, number of stalls (with distance=0) and stream
resets (after filling INDEX_SCAN_MAX_BATCHES).

And I see this (there's a complete explain output attached) for the two
queries from your message [1]. The

simple query:

SELECT max(abalance) FROM (SELECT * FROM pgbench_accounts ORDER BY aid
LIMIT 10000000);

complex query:

SELECT max(abalance), min(abalance), sum(abalance::numeric),
avg(abalance::numeric), avg(aid::numeric), avg(bid::numeric) FROM
(SELECT * FROM pgbench_accounts ORDER BY aid LIMIT 10000000);

The stats actually look *exactly* the same, which makes sense because
it's reading the same index.


   max_batches      distance      stalls      resets      stalls/reset
  --------------------------------------------------------------------
            64           272           3           3                 1
            32            59      122939         653               188
            16            36      108101         1190               90
             8            21       98775         2104               46
             4            11       95627         4556               20

I think this behavior mostly matches my expectations, although it's
interesting the stalls jump so much between 64 and 32 batches.

I did test both with buffered I/O (io_method=sync) and direct I/O
(io_method=worker), and the results are exactly the same for me. Not the
timings, of course, but the prefetch stats.

Of course, maybe there's something wrong in how the stats are collected.
I wonder if maybe we should update the distance in get_block() and not
in next_buffer().

Or maybe there's some interference from having to read the leaf pages
sooner. But I don't see why that would affect the queue depth, fewer
reset should keep the queues fuller I think.


I'll think about adding some sort of distance histogram to the stats.
Maybe something like tinyhist [2] would work here.



[1]
https://www.postgresql.org/message-id/h2n7d7zb2lbkdcemopvrgmteo35zzi5ljl2jmk32vz5f4pziql%407ppr6r6yfv4z

[2] https://github.com/tvondra/tinyhist


regards

-- 
Tomas Vondra
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 4835c48b448..5e6c3208955 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -407,12 +407,6 @@ index_beginscan_internal(Relation indexRelation,
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
 
-	/*
-	 * No batching by default, so set it to NULL. Will be initialized later if
-	 * batching is requested and AM supports it.
-	 */
-	scan->xs_batches = NULL;
-
 	return scan;
 }
 
@@ -463,6 +457,17 @@ index_rescan(IndexScanDesc scan,
 											orderbys, norderbys);
 }
 
+void
+index_get_prefetch_stats(IndexScanDesc scan, int *accum, int *count, int *stalls, int *resets)
+{
+	/* ugly */
+	if (scan->xs_heapfetch->rs != NULL)
+	{
+		read_stream_prefetch_stats(scan->xs_heapfetch->rs,
+					   accum, count, stalls, resets);
+	}
+}
+
 /* ----------------
  *		index_endscan - end a scan
  * ----------------
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7e2792ead71..d92f68d0533 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -136,6 +136,7 @@ static void show_memoize_info(MemoizeState *mstate, List *ancestors,
 							  ExplainState *es);
 static void show_hashagg_info(AggState *aggstate, ExplainState *es);
 static void show_indexsearches_info(PlanState *planstate, ExplainState *es);
+static void show_indexprefetch_info(PlanState *planstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 								ExplainState *es);
 static void show_instrumentation_count(const char *qlabel, int which,
@@ -1966,6 +1967,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			show_indexsearches_info(planstate, es);
+			show_indexprefetch_info(planstate, es);
 			break;
 		case T_IndexOnlyScan:
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
@@ -1983,6 +1985,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				ExplainPropertyFloat("Heap Fetches", NULL,
 									 planstate->instrument->ntuples2, 0, es);
 			show_indexsearches_info(planstate, es);
+			show_indexprefetch_info(planstate, es);
 			break;
 		case T_BitmapIndexScan:
 			show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
@@ -3889,6 +3892,45 @@ show_indexsearches_info(PlanState *planstate, ExplainState *es)
 	ExplainPropertyUInteger("Index Searches", NULL, nsearches, es);
 }
 
+static void
+show_indexprefetch_info(PlanState *planstate, ExplainState *es)
+{
+	Plan       *plan = planstate->plan;
+
+	int	count = 0,
+		accum = 0,
+		stalls = 0,
+		resets = 0;
+
+	if (!es->analyze)
+		return;
+
+	/* Initialize counters with stats from the local process first */
+	switch (nodeTag(plan))
+	{
+		case T_IndexScan:
+			{
+				IndexScanState *indexstate = ((IndexScanState *) planstate);
+
+				count = indexstate->iss_PrefetchCount;
+				accum = indexstate->iss_PrefetchAccum;
+				stalls = indexstate->iss_PrefetchStalls;
+				resets = indexstate->iss_ResetCount;
+				break;
+			}
+		default:
+			break;
+	}
+
+	if (count > 0)
+	{
+		ExplainPropertyFloat("Prefetch Distance", NULL, (accum * 1.0 / count), 3, es);
+		ExplainPropertyUInteger("Prefetch Stalls", NULL, stalls, es);
+		ExplainPropertyUInteger("Prefetch Resets", NULL, resets, es);
+	}
+}
+
+
 /*
  * Show exact/lossy pages for a BitmapHeapScan node
  */
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 7fcaa37fe62..5511732aad2 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -125,6 +125,12 @@ IndexNext(IndexScanState *node)
 						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 	}
 
+	index_get_prefetch_stats(scandesc,
+				 &node->iss_PrefetchAccum,
+				 &node->iss_PrefetchCount,
+				 &node->iss_PrefetchStalls,
+				 &node->iss_ResetCount);
+
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
@@ -1088,6 +1094,11 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 		indexstate->iss_RuntimeContext = NULL;
 	}
 
+	indexstate->iss_PrefetchAccum = 0;
+	indexstate->iss_PrefetchCount = 0;
+	indexstate->iss_PrefetchStalls = 0;
+	indexstate->iss_ResetCount = 0;
+
 	/*
 	 * all done.
 	 */
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 0e7f5557f5c..529cb5dcbee 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -106,6 +106,11 @@ struct ReadStream
 	bool		advice_enabled;
 	bool		temporary;
 
+	int		distance_accum;
+	int		distance_count;
+	int		distance_stalls;
+	int		reset_count;
+
 	/*
 	 * One-block buffer to support 'ungetting' a block number, to resolve flow
 	 * control problems when I/Os are split.
@@ -681,6 +686,11 @@ read_stream_begin_impl(int flags,
 	stream->seq_until_processed = InvalidBlockNumber;
 	stream->temporary = SmgrIsTemp(smgr);
 
+	stream->distance_accum = 0;
+	stream->distance_count = 0;
+	stream->distance_stalls = 0;
+	stream->reset_count = 0;
+
 	/*
 	 * Skip the initial ramp-up phase if the caller says we're going to be
 	 * reading the whole relation.  This way we start out assuming we'll be
@@ -772,6 +782,16 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	Buffer		buffer;
 	int16		oldest_buffer_index;
 
+	if (stream->distance > 0)
+	{
+		stream->distance_accum += stream->distance;
+		stream->distance_count += 1;
+	}
+	else
+	{
+		stream->distance_stalls += 1;
+	}
+
 #ifndef READ_STREAM_DISABLE_FAST_PATH
 
 	/*
@@ -1046,6 +1066,8 @@ read_stream_reset(ReadStream *stream)
 
 	/* Start off assuming data is cached. */
 	stream->distance = 1;
+
+	stream->reset_count += 1;
 }
 
 /*
@@ -1057,3 +1079,12 @@ read_stream_end(ReadStream *stream)
 	read_stream_reset(stream);
 	pfree(stream);
 }
+
+void
+read_stream_prefetch_stats(ReadStream *stream, int *accum, int *count, int *stalls, int *resets)
+{
+	*accum = stream->distance_accum;
+	*count = stream->distance_count;
+	*stalls = stream->distance_stalls;
+	*resets = stream->reset_count;
+}
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 3a3a44be3a5..51c85414b0a 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -235,6 +235,7 @@ extern bytea *index_opclass_options(Relation indrel, AttrNumber attnum,
 									Datum attoptions, bool validate);
 extern IndexScanBatch index_batch_alloc(int maxitems, bool want_itup);
 
+extern void index_get_prefetch_stats(IndexScanDesc scan, int *accum, int *count, int *stalls, int *resets);
 
 /*
  * index access method support routines (in genam.c)
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e107d6e5f81..ae18c6a5125 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1722,6 +1722,11 @@ typedef struct IndexScanState
 	IndexScanInstrumentation iss_Instrument;
 	SharedIndexScanInstrumentation *iss_SharedInfo;
 
+	int		iss_PrefetchAccum;
+	int		iss_PrefetchCount;
+	int		iss_PrefetchStalls;
+	int		iss_ResetCount;
+
 	/* These are needed for re-checking ORDER BY expr ordering */
 	pairingheap *iss_ReorderQueue;
 	bool		iss_ReachedEnd;
diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index 9b0d65161d0..53c754a3b3d 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -102,4 +102,6 @@ extern ReadStream *read_stream_begin_smgr_relation(int flags,
 extern void read_stream_reset(ReadStream *stream);
 extern void read_stream_end(ReadStream *stream);
 
+extern void read_stream_prefetch_stats(ReadStream *stream, int *accum, int *count, int *stalls, int *resets);
+
 #endif							/* READ_STREAM_H */
INDEX_SCAN_MAX_BATCHES 64

                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Aggregate (actual rows=1.00 loops=1)
   Buffers: shared read=191305
   ->  Limit (actual rows=10000000.00 loops=1)
         Buffers: shared read=191305
         ->  Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1)
               Index Searches: 1
               Prefetch Distance: 271.915
               Prefetch Stalls: 3
               Prefetch Resets: 3
               Buffers: shared read=191305
 Planning:
   Buffers: shared hit=46 read=22
 Planning Time: 3.996 ms
 Execution Time: 3593.458 ms
(14 rows)

                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Aggregate (actual rows=1.00 loops=1)
   Buffers: shared read=191305
   ->  Limit (actual rows=10000000.00 loops=1)
         Buffers: shared read=191305
         ->  Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1)
               Index Searches: 1
               Prefetch Distance: 271.915
               Prefetch Stalls: 3
               Prefetch Resets: 3
               Buffers: shared read=191305
 Planning:
   Buffers: shared hit=61 read=25
 Planning Time: 4.211 ms
 Execution Time: 5163.463 ms
(14 rows)


INDEX_SCAN_MAX_BATCHES 32

                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Aggregate (actual rows=1.00 loops=1)
   Buffers: shared read=191290
   ->  Limit (actual rows=10000000.00 loops=1)
         Buffers: shared read=191290
         ->  Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1)
               Index Searches: 1
               Prefetch Distance: 58.670
               Prefetch Stalls: 122939
               Prefetch Resets: 653
               Buffers: shared read=191290
 Planning:
   Buffers: shared hit=46 read=22
 Planning Time: 3.797 ms
 Execution Time: 3511.968 ms
(14 rows)

                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Aggregate (actual rows=1.00 loops=1)
   Buffers: shared read=191290
   ->  Limit (actual rows=10000000.00 loops=1)
         Buffers: shared read=191290
         ->  Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1)
               Index Searches: 1
               Prefetch Distance: 58.670
               Prefetch Stalls: 122939
               Prefetch Resets: 653
               Buffers: shared read=191290
 Planning:
   Buffers: shared hit=61 read=25
 Planning Time: 4.108 ms
 Execution Time: 5150.282 ms
(14 rows)


INDEX_SCAN_MAX_BATCHES 16
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Aggregate (actual rows=1.00 loops=1)
   Buffers: shared read=191271
   ->  Limit (actual rows=10000000.00 loops=1)
         Buffers: shared read=191271
         ->  Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1)
               Index Searches: 1
               Prefetch Distance: 35.562
               Prefetch Stalls: 108101
               Prefetch Resets: 1190
               Buffers: shared read=191271
 Planning:
   Buffers: shared hit=46 read=22
 Planning Time: 3.965 ms
 Execution Time: 3568.817 ms
(14 rows)

                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Aggregate (actual rows=1.00 loops=1)
   Buffers: shared read=191271
   ->  Limit (actual rows=10000000.00 loops=1)
         Buffers: shared read=191271
         ->  Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1)
               Index Searches: 1
               Prefetch Distance: 35.562
               Prefetch Stalls: 108101
               Prefetch Resets: 1190
               Buffers: shared read=191271
 Planning:
   Buffers: shared hit=61 read=25
 Planning Time: 4.371 ms
 Execution Time: 5108.845 ms
(14 rows)


INDEX_SCAN_MAX_BATCHES 8
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Aggregate (actual rows=1.00 loops=1)
   Buffers: shared read=191270
   ->  Limit (actual rows=10000000.00 loops=1)
         Buffers: shared read=191270
         ->  Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1)
               Index Searches: 1
               Prefetch Distance: 21.344
               Prefetch Stalls: 98775
               Prefetch Resets: 2104
               Buffers: shared read=191270
 Planning:
   Buffers: shared hit=46 read=22
 Planning Time: 3.706 ms
 Execution Time: 3646.034 ms
(14 rows)

                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Aggregate (actual rows=1.00 loops=1)
   Buffers: shared read=191270
   ->  Limit (actual rows=10000000.00 loops=1)
         Buffers: shared read=191270
         ->  Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1)
               Index Searches: 1
               Prefetch Distance: 21.344
               Prefetch Stalls: 98775
               Prefetch Resets: 2104
               Buffers: shared read=191270
 Planning:
   Buffers: shared hit=61 read=25
 Planning Time: 4.280 ms
 Execution Time: 5198.426 ms
(14 rows)


INDEX_SCAN_MAX_BATCHES 4
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Aggregate (actual rows=1.00 loops=1)
   Buffers: shared read=191266
   ->  Limit (actual rows=10000000.00 loops=1)
         Buffers: shared read=191266
         ->  Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1)
               Index Searches: 1
               Prefetch Distance: 10.688
               Prefetch Stalls: 95627
               Prefetch Resets: 4556
               Buffers: shared read=191266
 Planning:
   Buffers: shared hit=46 read=22
 Planning Time: 3.799 ms
 Execution Time: 3784.910 ms
(14 rows)

                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Aggregate (actual rows=1.00 loops=1)
   Buffers: shared read=191266
   ->  Limit (actual rows=10000000.00 loops=1)
         Buffers: shared read=191266
         ->  Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1)
               Index Searches: 1
               Prefetch Distance: 10.688
               Prefetch Stalls: 95627
               Prefetch Resets: 4556
               Buffers: shared read=191266
 Planning:
   Buffers: shared hit=61 read=25
 Planning Time: 4.330 ms
 Execution Time: 5195.308 ms
(14 rows)

Reply via email to