The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:       tested, passed
Spec compliant:           not tested
Documentation:            not tested

A fairly common planning problem for us is what we call "most recent first" 
queries; i.e., "the 50 most recent <table> rows for a <foreign key>".

Here's a basic setup:

-- created_at has very high cardinality
create table foo(pk serial primary key, owner_fk integer, created_at timestamp);
create index idx_foo_on_owner_and_created_at on foo(owner_fk, created_at);

-- technically this data guarantees unique created_at values,
-- but there's no reason it couldn't be modified to have a few
-- random non-unique values to prove the point
insert into foo(owner_fk, created_at)
  select i % 100, now() - (i::text || ' minutes')::interval
  from generate_series(1, 1000000) t(i);


And here's the naive query to get the results we want:

select *
from foo
where owner_fk = 23
-- pk is only here to disambiguate/guarantee a stable sort
-- in the rare case that there are collisions in the other
-- sort field(s)
order by created_at desc, pk desc
limit 50;


On stock Postgres this ends up being pretty terrible for cases where the fk 
filter represents a large number of rows, because the planner generates a sort 
node under the limit node and therefore fetches all matches, sorts them, and 
then applies the limit. Here's the plan:

 Limit  (cost=61386.12..61391.95 rows=50 width=16) (actual 
time=187.814..191.653 rows=50 loops=1)
   ->  Gather Merge  (cost=61386.12..70979.59 rows=82224 width=16) (actual 
time=187.813..191.647 rows=50 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=60386.10..60488.88 rows=41112 width=16) (actual 
time=185.639..185.642 rows=42 loops=3)
               Sort Key: created_at DESC, pk DESC
               Sort Method: top-N heapsort  Memory: 27kB
               Worker 0:  Sort Method: top-N heapsort  Memory: 27kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 27kB
               ->  Parallel Bitmap Heap Scan on foo  (cost=3345.24..59020.38 
rows=41112 width=16) (actual time=25.150..181.804 rows=33333 loops=3)
                     Recheck Cond: (owner_fk = 23)
                     Heap Blocks: exact=18014
                     ->  Bitmap Index Scan on idx_foo_on_owner_and_created_at  
(cost=0.00..3320.57 rows=98668 width=0) (actual time=16.992..16.992 rows=100000 
loops=1)
                           Index Cond: (owner_fk = 23)
 Planning Time: 0.384 ms
 Execution Time: 191.704 ms


I have a recursive CTE that implements the algorithm:
- Find first n+1 results
- If result at n+1’s created_at value differs from the n’s value, return first 
n values.
- If those equal, gather more results until a new created_at value is 
encountered.
- Sort all results by created_at and a tie-breaker (e.g., pk) and return the 
first n values.
But nobody wants to use/write that normally (it's quite complex).

This patch solves the problem presented; here's the plan:

 Limit  (cost=2.70..2.76 rows=50 width=16) (actual time=0.233..0.367 rows=50 
loops=1)
   ->  Incremental Sort  (cost=2.70..111.72 rows=98668 width=16) (actual 
time=0.232..0.362 rows=50 loops=1)
         Sort Key: created_at DESC, pk DESC
         Presorted Key: created_at
         Sort Method: quicksort  Memory: 26kB
         Sort Groups: 2
         ->  Index Scan Backward using idx_foo_on_owner_and_created_at on foo  
(cost=0.56..210640.79 rows=98668 width=16) (actual time=0.054..0.299 rows=65 
loops=1)
               Index Cond: (owner_fk = 23)
 Planning Time: 0.428 ms
 Execution Time: 0.393 ms


While check world fails, the only failure appears to be a plan output change in 
test/isolation/expected/drop-index-concurrently-1.out that just needs to be 
updated (incremental sort is now used in this plan); I don't see any 
functionality breakage.

Reply via email to