On 7/23/25 02:59, Andres Freund wrote: > Hi, > > On 2025-07-23 02:50:04 +0200, Tomas Vondra wrote: >> But I don't see why would this have any effect on the prefetch distance, >> queue depth etc. Or why decreasing INDEX_SCAN_MAX_BATCHES should improve >> that. I'd have expected exactly the opposite behavior. >> >> Could be bug, of course. But it'd be helpful to see the dataset/query. > > Pgbench scale 500, with the simpler query from my message. >
I tried to reproduce this, but I'm not seeing behavior. I'm not sure how you monitor the queue depth (presumably iostat?), but I added a basic prefetch info to explain (see the attached WIP patch), reporting the average prefetch distance, number of stalls (with distance=0) and stream resets (after filling INDEX_SCAN_MAX_BATCHES). And I see this (there's a complete explain output attached) for the two queries from your message [1]. The simple query: SELECT max(abalance) FROM (SELECT * FROM pgbench_accounts ORDER BY aid LIMIT 10000000); complex query: SELECT max(abalance), min(abalance), sum(abalance::numeric), avg(abalance::numeric), avg(aid::numeric), avg(bid::numeric) FROM (SELECT * FROM pgbench_accounts ORDER BY aid LIMIT 10000000); The stats actually look *exactly* the same, which makes sense because it's reading the same index. max_batches distance stalls resets stalls/reset -------------------------------------------------------------------- 64 272 3 3 1 32 59 122939 653 188 16 36 108101 1190 90 8 21 98775 2104 46 4 11 95627 4556 20 I think this behavior mostly matches my expectations, although it's interesting the stalls jump so much between 64 and 32 batches. I did test both with buffered I/O (io_method=sync) and direct I/O (io_method=worker), and the results are exactly the same for me. Not the timings, of course, but the prefetch stats. Of course, maybe there's something wrong in how the stats are collected. I wonder if maybe we should update the distance in get_block() and not in next_buffer(). Or maybe there's some interference from having to read the leaf pages sooner. But I don't see why that would affect the queue depth, fewer reset should keep the queues fuller I think. I'll think about adding some sort of distance histogram to the stats. Maybe something like tinyhist [2] would work here. [1] https://www.postgresql.org/message-id/h2n7d7zb2lbkdcemopvrgmteo35zzi5ljl2jmk32vz5f4pziql%407ppr6r6yfv4z [2] https://github.com/tvondra/tinyhist regards -- Tomas Vondra
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c index 4835c48b448..5e6c3208955 100644 --- a/src/backend/access/index/indexam.c +++ b/src/backend/access/index/indexam.c @@ -407,12 +407,6 @@ index_beginscan_internal(Relation indexRelation, scan->parallel_scan = pscan; scan->xs_temp_snap = temp_snap; - /* - * No batching by default, so set it to NULL. Will be initialized later if - * batching is requested and AM supports it. - */ - scan->xs_batches = NULL; - return scan; } @@ -463,6 +457,17 @@ index_rescan(IndexScanDesc scan, orderbys, norderbys); } +void +index_get_prefetch_stats(IndexScanDesc scan, int *accum, int *count, int *stalls, int *resets) +{ + /* ugly */ + if (scan->xs_heapfetch->rs != NULL) + { + read_stream_prefetch_stats(scan->xs_heapfetch->rs, + accum, count, stalls, resets); + } +} + /* ---------------- * index_endscan - end a scan * ---------------- diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 7e2792ead71..d92f68d0533 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -136,6 +136,7 @@ static void show_memoize_info(MemoizeState *mstate, List *ancestors, ExplainState *es); static void show_hashagg_info(AggState *aggstate, ExplainState *es); static void show_indexsearches_info(PlanState *planstate, ExplainState *es); +static void show_indexprefetch_info(PlanState *planstate, ExplainState *es); static void show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es); static void show_instrumentation_count(const char *qlabel, int which, @@ -1966,6 +1967,7 @@ ExplainNode(PlanState *planstate, List *ancestors, show_instrumentation_count("Rows Removed by Filter", 1, planstate, es); show_indexsearches_info(planstate, es); + show_indexprefetch_info(planstate, es); break; case T_IndexOnlyScan: show_scan_qual(((IndexOnlyScan *) plan)->indexqual, @@ -1983,6 +1985,7 @@ ExplainNode(PlanState *planstate, List *ancestors, ExplainPropertyFloat("Heap Fetches", NULL, planstate->instrument->ntuples2, 0, es); show_indexsearches_info(planstate, es); + show_indexprefetch_info(planstate, es); break; case T_BitmapIndexScan: show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig, @@ -3889,6 +3892,45 @@ show_indexsearches_info(PlanState *planstate, ExplainState *es) ExplainPropertyUInteger("Index Searches", NULL, nsearches, es); } +static void +show_indexprefetch_info(PlanState *planstate, ExplainState *es) +{ + Plan *plan = planstate->plan; + + int count = 0, + accum = 0, + stalls = 0, + resets = 0; + + if (!es->analyze) + return; + + /* Initialize counters with stats from the local process first */ + switch (nodeTag(plan)) + { + case T_IndexScan: + { + IndexScanState *indexstate = ((IndexScanState *) planstate); + + count = indexstate->iss_PrefetchCount; + accum = indexstate->iss_PrefetchAccum; + stalls = indexstate->iss_PrefetchStalls; + resets = indexstate->iss_ResetCount; + break; + } + default: + break; + } + + if (count > 0) + { + ExplainPropertyFloat("Prefetch Distance", NULL, (accum * 1.0 / count), 3, es); + ExplainPropertyUInteger("Prefetch Stalls", NULL, stalls, es); + ExplainPropertyUInteger("Prefetch Resets", NULL, resets, es); + } +} + + /* * Show exact/lossy pages for a BitmapHeapScan node */ diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c index 7fcaa37fe62..5511732aad2 100644 --- a/src/backend/executor/nodeIndexscan.c +++ b/src/backend/executor/nodeIndexscan.c @@ -125,6 +125,12 @@ IndexNext(IndexScanState *node) node->iss_OrderByKeys, node->iss_NumOrderByKeys); } + index_get_prefetch_stats(scandesc, + &node->iss_PrefetchAccum, + &node->iss_PrefetchCount, + &node->iss_PrefetchStalls, + &node->iss_ResetCount); + /* * ok, now that we have what we need, fetch the next tuple. */ @@ -1088,6 +1094,11 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags) indexstate->iss_RuntimeContext = NULL; } + indexstate->iss_PrefetchAccum = 0; + indexstate->iss_PrefetchCount = 0; + indexstate->iss_PrefetchStalls = 0; + indexstate->iss_ResetCount = 0; + /* * all done. */ diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c index 0e7f5557f5c..529cb5dcbee 100644 --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -106,6 +106,11 @@ struct ReadStream bool advice_enabled; bool temporary; + int distance_accum; + int distance_count; + int distance_stalls; + int reset_count; + /* * One-block buffer to support 'ungetting' a block number, to resolve flow * control problems when I/Os are split. @@ -681,6 +686,11 @@ read_stream_begin_impl(int flags, stream->seq_until_processed = InvalidBlockNumber; stream->temporary = SmgrIsTemp(smgr); + stream->distance_accum = 0; + stream->distance_count = 0; + stream->distance_stalls = 0; + stream->reset_count = 0; + /* * Skip the initial ramp-up phase if the caller says we're going to be * reading the whole relation. This way we start out assuming we'll be @@ -772,6 +782,16 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data) Buffer buffer; int16 oldest_buffer_index; + if (stream->distance > 0) + { + stream->distance_accum += stream->distance; + stream->distance_count += 1; + } + else + { + stream->distance_stalls += 1; + } + #ifndef READ_STREAM_DISABLE_FAST_PATH /* @@ -1046,6 +1066,8 @@ read_stream_reset(ReadStream *stream) /* Start off assuming data is cached. */ stream->distance = 1; + + stream->reset_count += 1; } /* @@ -1057,3 +1079,12 @@ read_stream_end(ReadStream *stream) read_stream_reset(stream); pfree(stream); } + +void +read_stream_prefetch_stats(ReadStream *stream, int *accum, int *count, int *stalls, int *resets) +{ + *accum = stream->distance_accum; + *count = stream->distance_count; + *stalls = stream->distance_stalls; + *resets = stream->reset_count; +} diff --git a/src/include/access/genam.h b/src/include/access/genam.h index 3a3a44be3a5..51c85414b0a 100644 --- a/src/include/access/genam.h +++ b/src/include/access/genam.h @@ -235,6 +235,7 @@ extern bytea *index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions, bool validate); extern IndexScanBatch index_batch_alloc(int maxitems, bool want_itup); +extern void index_get_prefetch_stats(IndexScanDesc scan, int *accum, int *count, int *stalls, int *resets); /* * index access method support routines (in genam.c) diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h index e107d6e5f81..ae18c6a5125 100644 --- a/src/include/nodes/execnodes.h +++ b/src/include/nodes/execnodes.h @@ -1722,6 +1722,11 @@ typedef struct IndexScanState IndexScanInstrumentation iss_Instrument; SharedIndexScanInstrumentation *iss_SharedInfo; + int iss_PrefetchAccum; + int iss_PrefetchCount; + int iss_PrefetchStalls; + int iss_ResetCount; + /* These are needed for re-checking ORDER BY expr ordering */ pairingheap *iss_ReorderQueue; bool iss_ReachedEnd; diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h index 9b0d65161d0..53c754a3b3d 100644 --- a/src/include/storage/read_stream.h +++ b/src/include/storage/read_stream.h @@ -102,4 +102,6 @@ extern ReadStream *read_stream_begin_smgr_relation(int flags, extern void read_stream_reset(ReadStream *stream); extern void read_stream_end(ReadStream *stream); +extern void read_stream_prefetch_stats(ReadStream *stream, int *accum, int *count, int *stalls, int *resets); + #endif /* READ_STREAM_H */
INDEX_SCAN_MAX_BATCHES 64 QUERY PLAN ---------------------------------------------------------------------------------------------------------- Aggregate (actual rows=1.00 loops=1) Buffers: shared read=191305 -> Limit (actual rows=10000000.00 loops=1) Buffers: shared read=191305 -> Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1) Index Searches: 1 Prefetch Distance: 271.915 Prefetch Stalls: 3 Prefetch Resets: 3 Buffers: shared read=191305 Planning: Buffers: shared hit=46 read=22 Planning Time: 3.996 ms Execution Time: 3593.458 ms (14 rows) QUERY PLAN ---------------------------------------------------------------------------------------------------------- Aggregate (actual rows=1.00 loops=1) Buffers: shared read=191305 -> Limit (actual rows=10000000.00 loops=1) Buffers: shared read=191305 -> Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1) Index Searches: 1 Prefetch Distance: 271.915 Prefetch Stalls: 3 Prefetch Resets: 3 Buffers: shared read=191305 Planning: Buffers: shared hit=61 read=25 Planning Time: 4.211 ms Execution Time: 5163.463 ms (14 rows) INDEX_SCAN_MAX_BATCHES 32 QUERY PLAN ---------------------------------------------------------------------------------------------------------- Aggregate (actual rows=1.00 loops=1) Buffers: shared read=191290 -> Limit (actual rows=10000000.00 loops=1) Buffers: shared read=191290 -> Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1) Index Searches: 1 Prefetch Distance: 58.670 Prefetch Stalls: 122939 Prefetch Resets: 653 Buffers: shared read=191290 Planning: Buffers: shared hit=46 read=22 Planning Time: 3.797 ms Execution Time: 3511.968 ms (14 rows) QUERY PLAN ---------------------------------------------------------------------------------------------------------- Aggregate (actual rows=1.00 loops=1) Buffers: shared read=191290 -> Limit (actual rows=10000000.00 loops=1) Buffers: shared read=191290 -> Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1) Index Searches: 1 Prefetch Distance: 58.670 Prefetch Stalls: 122939 Prefetch Resets: 653 Buffers: shared read=191290 Planning: Buffers: shared hit=61 read=25 Planning Time: 4.108 ms Execution Time: 5150.282 ms (14 rows) INDEX_SCAN_MAX_BATCHES 16 QUERY PLAN ---------------------------------------------------------------------------------------------------------- Aggregate (actual rows=1.00 loops=1) Buffers: shared read=191271 -> Limit (actual rows=10000000.00 loops=1) Buffers: shared read=191271 -> Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1) Index Searches: 1 Prefetch Distance: 35.562 Prefetch Stalls: 108101 Prefetch Resets: 1190 Buffers: shared read=191271 Planning: Buffers: shared hit=46 read=22 Planning Time: 3.965 ms Execution Time: 3568.817 ms (14 rows) QUERY PLAN ---------------------------------------------------------------------------------------------------------- Aggregate (actual rows=1.00 loops=1) Buffers: shared read=191271 -> Limit (actual rows=10000000.00 loops=1) Buffers: shared read=191271 -> Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1) Index Searches: 1 Prefetch Distance: 35.562 Prefetch Stalls: 108101 Prefetch Resets: 1190 Buffers: shared read=191271 Planning: Buffers: shared hit=61 read=25 Planning Time: 4.371 ms Execution Time: 5108.845 ms (14 rows) INDEX_SCAN_MAX_BATCHES 8 QUERY PLAN ---------------------------------------------------------------------------------------------------------- Aggregate (actual rows=1.00 loops=1) Buffers: shared read=191270 -> Limit (actual rows=10000000.00 loops=1) Buffers: shared read=191270 -> Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1) Index Searches: 1 Prefetch Distance: 21.344 Prefetch Stalls: 98775 Prefetch Resets: 2104 Buffers: shared read=191270 Planning: Buffers: shared hit=46 read=22 Planning Time: 3.706 ms Execution Time: 3646.034 ms (14 rows) QUERY PLAN ---------------------------------------------------------------------------------------------------------- Aggregate (actual rows=1.00 loops=1) Buffers: shared read=191270 -> Limit (actual rows=10000000.00 loops=1) Buffers: shared read=191270 -> Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1) Index Searches: 1 Prefetch Distance: 21.344 Prefetch Stalls: 98775 Prefetch Resets: 2104 Buffers: shared read=191270 Planning: Buffers: shared hit=61 read=25 Planning Time: 4.280 ms Execution Time: 5198.426 ms (14 rows) INDEX_SCAN_MAX_BATCHES 4 QUERY PLAN ---------------------------------------------------------------------------------------------------------- Aggregate (actual rows=1.00 loops=1) Buffers: shared read=191266 -> Limit (actual rows=10000000.00 loops=1) Buffers: shared read=191266 -> Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1) Index Searches: 1 Prefetch Distance: 10.688 Prefetch Stalls: 95627 Prefetch Resets: 4556 Buffers: shared read=191266 Planning: Buffers: shared hit=46 read=22 Planning Time: 3.799 ms Execution Time: 3784.910 ms (14 rows) QUERY PLAN ---------------------------------------------------------------------------------------------------------- Aggregate (actual rows=1.00 loops=1) Buffers: shared read=191266 -> Limit (actual rows=10000000.00 loops=1) Buffers: shared read=191266 -> Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1) Index Searches: 1 Prefetch Distance: 10.688 Prefetch Stalls: 95627 Prefetch Resets: 4556 Buffers: shared read=191266 Planning: Buffers: shared hit=61 read=25 Planning Time: 4.330 ms Execution Time: 5195.308 ms (14 rows)