The following review has been posted through the commitfest application: make installcheck-world: tested, failed Implements feature: tested, passed Spec compliant: not tested Documentation: not tested
A fairly common planning problem for us is what we call "most recent first" queries; i.e., "the 50 most recent <table> rows for a <foreign key>". Here's a basic setup: -- created_at has very high cardinality create table foo(pk serial primary key, owner_fk integer, created_at timestamp); create index idx_foo_on_owner_and_created_at on foo(owner_fk, created_at); -- technically this data guarantees unique created_at values, -- but there's no reason it couldn't be modified to have a few -- random non-unique values to prove the point insert into foo(owner_fk, created_at) select i % 100, now() - (i::text || ' minutes')::interval from generate_series(1, 1000000) t(i); And here's the naive query to get the results we want: select * from foo where owner_fk = 23 -- pk is only here to disambiguate/guarantee a stable sort -- in the rare case that there are collisions in the other -- sort field(s) order by created_at desc, pk desc limit 50; On stock Postgres this ends up being pretty terrible for cases where the fk filter represents a large number of rows, because the planner generates a sort node under the limit node and therefore fetches all matches, sorts them, and then applies the limit. Here's the plan: Limit (cost=61386.12..61391.95 rows=50 width=16) (actual time=187.814..191.653 rows=50 loops=1) -> Gather Merge (cost=61386.12..70979.59 rows=82224 width=16) (actual time=187.813..191.647 rows=50 loops=1) Workers Planned: 2 Workers Launched: 2 -> Sort (cost=60386.10..60488.88 rows=41112 width=16) (actual time=185.639..185.642 rows=42 loops=3) Sort Key: created_at DESC, pk DESC Sort Method: top-N heapsort Memory: 27kB Worker 0: Sort Method: top-N heapsort Memory: 27kB Worker 1: Sort Method: top-N heapsort Memory: 27kB -> Parallel Bitmap Heap Scan on foo (cost=3345.24..59020.38 rows=41112 width=16) (actual time=25.150..181.804 rows=33333 loops=3) Recheck Cond: (owner_fk = 23) Heap Blocks: exact=18014 -> Bitmap Index Scan on idx_foo_on_owner_and_created_at (cost=0.00..3320.57 rows=98668 width=0) (actual time=16.992..16.992 rows=100000 loops=1) Index Cond: (owner_fk = 23) Planning Time: 0.384 ms Execution Time: 191.704 ms I have a recursive CTE that implements the algorithm: - Find first n+1 results - If result at n+1’s created_at value differs from the n’s value, return first n values. - If those equal, gather more results until a new created_at value is encountered. - Sort all results by created_at and a tie-breaker (e.g., pk) and return the first n values. But nobody wants to use/write that normally (it's quite complex). This patch solves the problem presented; here's the plan: Limit (cost=2.70..2.76 rows=50 width=16) (actual time=0.233..0.367 rows=50 loops=1) -> Incremental Sort (cost=2.70..111.72 rows=98668 width=16) (actual time=0.232..0.362 rows=50 loops=1) Sort Key: created_at DESC, pk DESC Presorted Key: created_at Sort Method: quicksort Memory: 26kB Sort Groups: 2 -> Index Scan Backward using idx_foo_on_owner_and_created_at on foo (cost=0.56..210640.79 rows=98668 width=16) (actual time=0.054..0.299 rows=65 loops=1) Index Cond: (owner_fk = 23) Planning Time: 0.428 ms Execution Time: 0.393 ms While check world fails, the only failure appears to be a plan output change in test/isolation/expected/drop-index-concurrently-1.out that just needs to be updated (incremental sort is now used in this plan); I don't see any functionality breakage.