Re: New access method for b-tree.

Alexandre Felipe Mon, 23 Feb 2026 14:09:00 -0800

Hi Hackers,

Do you think that MERGE-SCAN was a terrible name? I wanted a name that
wouldn't


require much explanation. I named it like this because it relies on a k-way

merge to combine several segments of an index in one result. But we already

have a MERGE statement. Even in the example plan above we can see
an external

merge that has nothing to do with the new feature, and now as I am doing
joins,

I started doing it on the NestedLoop trying to follow the same conditions
that

lead to a memoize. But I added so many fields to the NestedLoop state that I


think it is good to have a separate structure, and maybe a separate node,
and

MergeScan of course is taken hehe. I was thinking of IndexPrefixMerge. We
could

use the Ants nickname TimeLineScan, but of course it is not limited to time

lines (even though realistically, that will probably be the most common use
of

this). Another one I considered was TransposedIndexScan, because it orders

output on (suffix, prefix) instead of (prefix, suffix).




On Fri, Feb 6, 2026 at 10:52 AM Alexandre Felipe <
[email protected]> wrote:

> Hello again hackers!
>
> [email protected] <[email protected]>: That seems to be the one that is probably the
> most familiar with the index scan (based on the commits).
> [email protected] <[email protected]> , [email protected]
> <[email protected]> , [email protected] <[email protected]> as the
> top 3 committers to nbtree over the last ~6 months.
>
> I  have made substantial progress on adding a few features. I have
> questions, but I will let you go first :)
>
> Motivation:
> *In technical terms:* this proposal is to take advantage of a btree index
> when the query is filtered by a few distinct prefixes and ordered by a
> suffix and has a limit.
> *In non technical:* This could help to efficiently render a social
> network feed, where each user can select a list of users whose posts they
> want to see, and the posts must be ordered from newest to oldest.
>
>
> *Performance Comparison*
> I did a test with a toy table, please find more details below.
>
> With limit 100
>
> | Method     | Shared Hit | Shared Read | Exec Time |
> |------------|-----------:|------------:|----------:|
> | Merge      |         13 |         119 |     13 ms |
> | IndexScan  |     15,308 |     525,310 |  3,409 ms |
>
> With limit 1,000,000
>
> | Method     |  SharedHit | SharRead | Temp I | Temp O | Exec Time |
> |------------|-----------:|---------:|-------:|-------:|----------:|
> | Merge      |    980,318 |   19,721 |      0 |      0 |  2,128 ms |
> | Sequential |     15,208 |  525,410 | 20,207 | 35,384 |  3,762 ms |
> | Bitmap     |        629 |  113,759 | 20,207 | 35,385 |  5,487 ms |
> | IndexScan  |  7,880,619 |  126,706 | 20,945 | 35,386 |  5,874 ms |
>
> Sequential scans and bitmap scans in this case reduces significantly the
> number of
> accessed buff because the table has only four integer columns, and these
> methods
> can read all the lines on a given page at a time.
>
> However that comes at the cost of resorting to an in-disk sort method.
> For the query with limit 100 we get no temp files as we are using a
> top-100 sort.
>
> make check passes
>
>
> *Experiment details*
>
> Consider a 100M row table formed (a,b,c,d) \in 100 x 100 x 100 x 100
>
>
> ```sql
> CREATE TABLE grid AS (
>     SELECT a, b, c, d, FROM
>         generate_series(1, 100) AS a,
>         generate_series(1, 100) AS b,
>         generate_series(1, 100) AS c,
>         generate_series(1, 100) AS d
> );
>
> CREATE INDEX grid_index ON grid (a, b, c);
> ANALYSE grid;
> ```
>
> Now let's say that we need to find certain number of rows filtered by a
> and ordered by b;
> ```sql
> PREPARE grid_query(int) AS
> SELECT sum(d) FROM (
>     SELECT * FROM grid
>     WHERE a IN (2,3,5,8,13,21,34,55) AND b >= 0
>     ORDER BY b
>     LIMIT $1) t;
> ```
>
> ---
>
>
> Now with limit 100, with index merge scan (notice Index Prefixes in the
> plan).
>
> ```sql
> SET enable_indexmergescan = on;
> EXPLAIN (ANALYSE) EXECUTE grid_query(100);
> ```
>
> ```text
>    Buffers: shared hit=13 read=119
>    ->  Limit  (cost=0.57..87.29 rows=100 width=16) (actual
> time=5.528..12.999 rows=100.00 loops=1)
>          Buffers: shared hit=13 read=119
>          ->  Index Scan using grid_a_b_c_idx on grid  (cost=0.57..93.36
> rows=107 width=16) (actual time=5.528..12.994 rows=100.00 loops=1)
>                Index Cond: (b >= 0)
>                *Index Prefixes: *(a = ANY
> ('{2,3,5,8,13,21,34,55}'::integer[]))
>                Index Searches: 8
>                Buffers: shared hit=13 read=119
>  Planning:
>    Buffers: shared hit=59 read=23
>  Planning Time: 4.619 ms
>  Execution Time: 13.055 ms
>  ```
>
>
> ```sql
> SET enable_indexmergescan = off;
> EXPLAIN (ANALYSE) EXECUTE grid_query(100);
> ```
>
> ```text
>  Aggregate  (cost=1603588.06..1603588.07 rows=1 width=8) (actual
> time=3406.624..3408.710 rows=1.00 loops=1)
>    Buffers: shared hit=15308 read=525310
>    ->  Limit  (cost=1603575.17..1603586.81 rows=100 width=16) (actual
> time=3406.601..3408.702 rows=100.00 loops=1)
>          Buffers: shared hit=15308 read=525310
>          ->  Gather Merge  (cost=1603575.17..2514342.92 rows=7819999
> width=16) (actual time=3406.598..3408.695 rows=100.00 loops=1)
>                Workers Planned: 2
>                Workers Launched: 2
>                Buffers: shared hit=15308 read=525310
>                ->  Sort  (cost=1602575.14..1610720.98 rows=3258333
> width=16) (actual time=3393.782..3393.784 rows=100.00 loops=3)
>                      Sort Key: grid.b
>                      Sort Method: top-N heapsort  Memory: 32kB
>                      Buffers: shared hit=15308 read=525310
>                      Worker 0:  Sort Method: top-N heapsort  Memory: 32kB
>                      Worker 1:  Sort Method: top-N heapsort  Memory: 32kB
>                      ->  *Parallel Seq Scan* on grid
>  (cost=0.00..1478044.00 rows=3258333 width=16) (actual time=0.944..3129.896
> rows=2666666.67 loops=3)
>                            Filter: ((b >= 0) AND (a = ANY
> ('{2,3,5,8,13,21,34,55}'::integer[])))
>                            Rows Removed by Filter: 30666667
>                            Buffers: shared hit=15234 read=525310
>  Planning Time: 0.370 ms
>  Execution Time: 3409.134 ms
>  ```
>
> Now queries with limit 1,000,000
>
> ```sql
> SET enable_indexmergescan = on;
> EXPLAIN ANALYSE EXECUTE grid_query(1000000);
> ```
>
> Query executed with the proposed access method. Notice in the plan Index
> Prefixes and Index Cond.
> ```text
>    Buffers: shared hit=980318 read=19721
>    ->  Limit  (cost=0.57..867259.84 rows=1000000 width=16) (actual
> time=2.854..2103.438 rows=1000000.00 loops=1)
>          Buffers: shared hit=980318 read=19721
>          ->  Index Scan using grid_a_b_c_idx on grid
>  (cost=0.57..867265.91 rows=1000007 width=16) (actual time=2.852..2066.205
> rows=1000000.00 loops=1)
>                Index Cond: (b >= 0)
>                *Index Prefixes:* (a = ANY
> ('{2,3,5,8,13,21,34,55}'::integer[]))
>                Index Searches: 8
>                Buffers: shared hit=980318 read=19721
>  Planning Time: 0.328 ms
>  Execution Time: 2127.811 ms
>  ```
>
> If we disable index_mergescan we naturally we fall into a sequential scan.
>
> ```sql
> SET enable_indexmergescan = off;
> EXPLAIN ANALYSE EXECUTE grid_query(1000000);
> ```
> ```text
>    Buffers: shared hit=15208 read=525410, temp read=20207 written=35384
>    ->  Limit  (cost=1942895.64..2059362.12 rows=1000000 width=16) (actual
> time=3467.012..3712.044 rows=1000000.00 loops=1)
>          Buffers: shared hit=15208 read=525410, temp read=20207
> written=35384
>          ->  Gather Merge  (cost=1942895.64..2853663.39 rows=7819999
> width=16) (actual time=3467.010..3671.220 rows=1000000.00 loops=1)
>                Workers Planned: 2
>                Workers Launched: 2
>                Buffers: shared hit=15208 read=525410, temp read=20207
> written=35384
>                ->  Sort  (cost=1941895.62..1950041.45 rows=3258333
> width=16) (actual time=3455.852..3476.358 rows=334576.33 loops=3)
>                      Sort Key: grid.b
>                      Sort Method: *external merge  Disk: 47016kB*
>                      Buffers: shared hit=15208 read=525410, temp
> read=20207 written=35384
>                      Worker 0:  Sort Method: external merge  Disk: 46976kB
>                      Worker 1:  Sort Method: external merge  Disk: 47000kB
>                      ->  *Parallel Seq Scan* on grid
>  (cost=0.00..1478044.00 rows=3258333 width=16) (actual time=2.789..2779.483
> rows=2666666.67 loops=3)
>                            Filter: ((b >= 0) AND (a = ANY
> ('{2,3,5,8,13,21,34,55}'::integer[])))
>                            Rows Removed by Filter: 30666667
>                            Buffers: shared hit=15134 read=525410
>  Planning Time: 0.332 ms
>  Execution Time: 3761.866 ms
> ```
>
> If we disable sequential scans, then we get a bitmap scan
>
> ```sql
> SET enable_seqscan = off;
> EXPLAIN ANALYSE EXECUTE grid_query(1000000);
> ```
> ```text
>    Buffers: shared hit=629 read=113759 written=2, temp read=20207
> written=35385
>    ->  Limit  (cost=1998199.78..2114666.26 rows=1000000 width=16) (actual
> time=5170.456..5453.433 rows=1000000.00 loops=1)
>          Buffers: shared hit=629 read=113759 written=2, temp read=20207
> written=35385
>          ->  Gather Merge  (cost=1998199.78..2908967.53 rows=7819999
> width=16) (actual time=5170.455..5413.254 rows=1000000.00 loops=1)
>                Workers Planned: 2
>                Workers Launched: 2
>                Buffers: shared hit=629 read=113759 written=2, temp
> read=20207 written=35385
>                ->  Sort  (cost=1997199.75..2005345.59 rows=3258333
> width=16) (actual time=5156.929..5177.507 rows=334500.67 loops=3)
>                      Sort Key: grid.b
>                      Sort Method: external merge  Disk: 47032kB
>                      Buffers: shared hit=629 read=113759 written=2, temp
> read=20207 written=35385
>                      Worker 0:  Sort Method: external merge  Disk: 47280kB
>                      Worker 1:  Sort Method: external merge  Disk: 46680kB
>                      ->  Parallel Bitmap Heap Scan on grid
>  (cost=107691.54..1533348.13 rows=3258333 width=16) (actual
> time=299.891..4489.787 rows=2666666.67 loops=3)
>                            Recheck Cond: ((a = ANY
> ('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >= 0))
>                            Rows *Removed by Index Recheck*: 2410242
>                            Heap Blocks: exact=13100 lossy=22639
>                            Buffers: shared hit=615 read=113759 written=2
>                            Worker 0:  Heap Blocks: exact=13077 lossy=22755
>                            Worker 1:  Heap Blocks: exact=13036 lossy=22421
>                            ->  *Bitmap Index Scan* on grid_a_b_c_idx
>  (cost=0.00..105736.54 rows=7820000 width=0) (actual time=297.651..297.651
> rows=8000000.00 loops=1)
>                                  Index Cond: ((a = ANY
> ('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >= 0))
>                                  Index Searches: 7
>                                  Buffers: shared hit=13 read=7293 written=2
>  Planning Time: 0.165 ms
>  Execution Time: 5487.213 ms
> ```
>
> If we disable bitmap scans we finally get an index scan
>
> ```sql
> SET enable_bitmapscan = off;
> EXPLAIN ANALYSE EXECUTE grid_query(1000000);
> ```
> ```
>    Buffers: shared hit=7883221 read=124111, temp read=20699 written=35385
>    ->  Limit  (cost=7201203.08..7317669.55 rows=1000000 width=16) (actual
> time=4414.478..4674.400 rows=1000000.00 loops=1)
>          Buffers: shared hit=7883221 read=124111, temp read=20699
> written=35385
>          ->  Gather Merge  (cost=7201203.08..8111970.83 rows=7819999
> width=16) (actual time=4414.476..4633.982 rows=1000000.00 loops=1)
>                Workers Planned: 2
>                Workers Launched: 2
>                Buffers: shared hit=7883221 read=124111, temp read=20699
> written=35385
>                ->  Sort  (cost=7200203.05..7208348.88 rows=3258333
> width=16) (actual time=4390.625..4411.896 rows=334567.00 loops=3)
>                      Sort Key: grid.b
>                      Sort Method: *external merge  Disk: 47304kB*
>                      Buffers: shared hit=7883221 read=124111, temp
> read=20699 written=35385
>                      Worker 0:  Sort Method: external merge  Disk: 47304kB
>                      Worker 1:  Sort Method: external merge  Disk: 46384kB
>                      ->  *Parallel Index Scan* using grid_a_b_c_idx on
> grid  (cost=0.57..6736351.43 rows=3258333 width=16) (actual
> time=46.925..3796.915 rows=2666666.67 loops=3)
>                            Index Cond: ((a = ANY
> ('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >= 0))
>                            Index Searches: 7
>                            Buffers: shared hit=7883208 read=124110
>  Planning Time: 0.385 ms
>  Execution Time: 4713.325 ms
>  ```
>
>
>
>
>
>
> On Thu, Feb 5, 2026 at 6:59 AM Alexandre Felipe <
> [email protected]> wrote:
>
>> Thank you for looking into this.
>>
>> Now we can execute a, still narrow, family queries!
>>
>> Maybe it helps to see this as a *social network feeds*. Imagine a social
>> network, you have a few friends, or follow a few people, and you want to
>> see their updates ordered by date. For each user we have a different
>> combination of users that we have to display. But maybe, even having
>> hundreds of users we will only show the first 10.
>>
>> There is a low hanging fruit on the skip scan, if we need N rows, and one
>> group already has M rows we could stop there.
>> If Nx is the number of friends, and M is the number of posts to show.
>> This runs with complexity (Nx * M) rows, followed by an (Nx * M) sort,
>> instead of (Nx * N) followed by an (Nx * N) sort.
>> Where M = 10 and N is 1000 this is a significant improvement.
>> But if M ~ N, the merge scan that runs with M + Nx row accesses, (M + Nx)
>> heap operations.
>> If everything is on the same page the skip scan would win.
>>
>> The cost estimation is probably far off.
>> I am also not considering the filters applied after this operator, and I
>> don't know if the planner infrastructure is able to adjust it by itself.
>> This is where I would like reviewer's feedback. I think that the planner
>> costs are something to be determined experimentally.
>>
>> Next I will make it slightly more general handling
>> * More index columns: Index (a, b, s...) could support WHERE a IN (...)
>> ORDER BY b LIMIT N (ignoring s...)
>> * Multi-column prefix: WHERE (a, b) IN (...) ORDER BY c
>> * Non-leading prefix: WHERE b IN (...) AND a = const ORDER BY c on index
>> (a, b, c)
>>
>> ---
>> Kind Regards,
>> Alexandre
>>
>> On Wed, Feb 4, 2026 at 7:13 AM Michał Kłeczek <[email protected]> wrote:
>>
>>>
>>>
>>> On 3 Feb 2026, at 22:42, Ants Aasma <[email protected]> wrote:
>>>
>>> On Mon, 2 Feb 2026 at 01:54, Tomas Vondra <[email protected]> wrote:
>>>
>>> I'm also wondering how common is the targeted query pattern? How common
>>> it is to have an IN condition on the leading column in an index, and
>>> ORDER BY on the second one?
>>>
>>>
>>> I have seen this pattern multiple times. My nickname for it is the
>>> timeline view. Think of the social media timeline, showing posts from
>>> all followed accounts in timestamp order, returned in reasonably sized
>>> batches. The naive SQL query will have to scan all posts from all
>>> followed accounts and pass them through a top-N sort. When the total
>>> number of posts is much larger than the batch size this is much slower
>>> than what is proposed here (assuming I understand it correctly) -
>>> effectively equivalent to running N index scans through Merge Append.
>>>
>>>
>>> My workarounds I have proposed users have been either to rewrite the
>>> query as a UNION ALL of a set of single value prefix queries wrapped
>>> in an order by limit. This gives the exact needed merge append plan
>>> shape. But repeating the query N times can get unwieldy when the
>>> number of values grows, so the fallback is:
>>>
>>> SELECT * FROM unnest(:friends) id, LATERAL (
>>>    SELECT * FROM posts
>>>    WHERE user_id = id
>>>    ORDER BY tstamp DESC LIMIT 100)
>>> ORDER BY tstamp DESC LIMIT 100;
>>>
>>> The downside of this formulation is that we still have to fetch a
>>> batch worth of items from scans where we otherwise would have only had
>>> to look at one index tuple.
>>>
>>>
>>> GIST can be used to handle this kind of queries as it supports multiple
>>> sort orders.
>>> The only problem is that GIST does not support ORDER BY column.
>>> One possible workaround is [1] but as described there it does not play
>>> well with partitioning.
>>> I’ve started drafting support for ORDER BY column in GIST - see [2].
>>> I think it would be easier to implement and maintain than a new IAM (but
>>> I don’t have enough knowledge and experience to implement it myself)
>>>
>>> [1]
>>> https://www.postgresql.org/message-id/3FA1E0A9-8393-41F6-88BD-62EEEA1EC21F%40kleczek.org
>>> [2]
>>> https://www.postgresql.org/message-id/B2AC13F9-6655-4E27-BFD3-068844E5DC91%40kleczek.org
>>>
>>> —
>>> Kind regards,
>>> Michal
>>>
>>

Re: New access method for b-tree.

Reply via email to