Hi,

On 2025-07-16 14:30:05 -0400, Peter Geoghegan wrote:
> On Wed, Jul 16, 2025 at 2:27 PM Andres Freund <and...@anarazel.de> wrote:
> > Could you share the current version of the complex patch (happy with a git
> > tree)? Afaict it hasn't been posted, which makes this pretty hard follow 
> > along
> > / provide feedback on, for others.
>
> Sure:
>
> https://github.com/petergeoghegan/postgres/tree/index-prefetch-2025-pg-revisions-v0.11
>
> I think that the version that Tomas must have used is a few days old,
> and might be a tiny bit different. But I don't think that that's
> likely to matter, especially not if you just want to get the general
> idea.

As a first thing I just wanted to get a feel for the improvements we can get.
I had a scale 5 tpch already loaded, so I ran a bogus query on that to see.

The improvement with either of the patchsets with a quick trial query is
rather impressive when using direct IO (presumably also with an empty cache,
but DIO is more predictable).

As Peter's branch doesn't seem to have an enable_* GUC, I used
SET effective_io_concurrency=0 to test the non-prefetching results (and
verified with master that the results are similar).

Test:

Peter's:

Without prefetching:

SET effective_io_concurrency=0;SELECT 
pg_buffercache_evict_relation('lineitem');EXPLAIN ANALYZE SELECT * FROM 
lineitem ORDER BY l_shipdate LIMIT 10000;
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                     QUERY 
PLAN                                                                      │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit  (cost=0.44..2332.06 rows=10000 width=106) (actual time=0.611..957.874 
rows=10000.00 loops=1)                                                 │
│   Buffers: shared hit=1213 read=8626                                          
                                                                      │
│   I/O Timings: shared read=943.344                                            
                                                                      │
│   ->  Index Scan using i_l_shipdate on lineitem  (cost=0.44..6994824.33 
rows=29999796 width=106) (actual time=0.611..956.593 rows=10000.00 loops=1) │
│         Index Searches: 1                                                     
                                                                      │
│         Buffers: shared hit=1213 read=8626                                    
                                                                      │
│         I/O Timings: shared read=943.344                                      
                                                                      │
│ Planning Time: 0.083 ms                                                       
                                                                      │
│ Execution Time: 958.508 ms                                                    
                                                                      │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)


With prefetching:

SET effective_io_concurrency=64;SELECT 
pg_buffercache_evict_relation('lineitem');EXPLAIN ANALYZE SELECT * FROM 
lineitem ORDER BY l_shipdate LIMIT 10000;
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                     QUERY 
PLAN                                                                     │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit  (cost=0.44..2332.06 rows=10000 width=106) (actual time=0.497..67.737 
rows=10000.00 loops=1)                                                 │
│   Buffers: shared hit=1227 read=8667                                          
                                                                     │
│   I/O Timings: shared read=48.473                                             
                                                                     │
│   ->  Index Scan using i_l_shipdate on lineitem  (cost=0.44..6994824.33 
rows=29999796 width=106) (actual time=0.496..66.471 rows=10000.00 loops=1) │
│         Index Searches: 1                                                     
                                                                     │
│         Buffers: shared hit=1227 read=8667                                    
                                                                     │
│         I/O Timings: shared read=48.473                                       
                                                                     │
│ Planning Time: 0.090 ms                                                       
                                                                     │
│ Execution Time: 68.965 ms                                                     
                                                                     │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)

Tomas':

With prefetching:

SET effective_io_concurrency=64;SELECT 
pg_buffercache_evict_relation('lineitem');EXPLAIN ANALYZE SELECT * FROM 
lineitem ORDER BY l_shipdate LIMIT 10000;
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                     QUERY 
PLAN                                                                     │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit  (cost=0.44..2332.06 rows=10000 width=106) (actual time=0.278..70.609 
rows=10000.00 loops=1)                                                 │
│   Buffers: shared hit=1227 read=8668                                          
                                                                     │
│   I/O Timings: shared read=52.578                                             
                                                                     │
│   ->  Index Scan using i_l_shipdate on lineitem  (cost=0.44..6994824.33 
rows=29999796 width=106) (actual time=0.277..69.304 rows=10000.00 loops=1) │
│         Index Searches: 1                                                     
                                                                     │
│         Buffers: shared hit=1227 read=8668                                    
                                                                     │
│         I/O Timings: shared read=52.578                                       
                                                                     │
│ Planning Time: 0.072 ms                                                       
                                                                     │
│ Execution Time: 71.549 ms                                                     
                                                                     │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)

The wins are similar without DIO and a cold OS cache, but i don't like
emptying out the entire OS cache all the time...


I call that a hell of an impressive improvement with either patch - it's
really really hard to find order of magnitude improvements in anything close
to realistic cases.

And that's on a local reasonably fast NVMe - with networked storage we'll see
much bigger wins.

This also doesn't just repro with toy queries, e.g. TPCH Q02 shows a 2X
improvement too (with either patch) - the only reason it's not bigger is that
all the remaining IO time is on the inner side of a nestloop that isn't
currently prefetchable.


Peter, it'd be rather useful if your patch also had an enable/disable GUC,
otherwise it's more work to study the performance effects. The
effective_io_concurrency approach isn't great, because it also affects
bitmap scans, seqscans etc.


Just playing around, there are many cases where there is effectively no
difference between the two approaches, from a runtime perspective.  There,
unsurprisingly, are some where the complex approach clearly wins, mostly
around IN(list-of-constants) so far.


Looking at the actual patches now.


Greetings,

Andres Freund


Reply via email to