Hi hackers,

The cost of an Aggregate node seems to be the same regardless of the input being pre-sorted or not. If however the input is not sorted, the Aggregate node must additionally perform a sort which can impact runtime significantly. Here is an example:

CREATE TABLE foo(col0 INT, col1 TEXT);
INSERT INTO foo SELECT generate_series(1, 100000), 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' || md5(random()::TEXT);
CREATE INDEX foo_idx ON foo(col1);
SET max_parallel_workers_per_gather = 0;
SET enable_bitmapscan = FALSE;

-- Unsorted input. Aggregate node must additionally sort all rows.
EXPLAIN ANALYZE SELECT COUNT(DISTINCT(col1)) FROM foo;

                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2584.00..2584.01 rows=1 width=8) (actual time=1684.705..1684.809 rows=1 loops=1)    ->  Seq Scan on foo  (cost=0.00..2334.00 rows=100000 width=71) (actual time=0.018..343.280 rows=100000 loops=1)
 Planning Time: 0.317 ms
 Execution Time: 1685.543 ms


-- Pre-sorted input. Aggregate node doesn't have to sort all rows.
SET enable_seqscan = FALSE;
EXPLAIN ANALYZE SELECT COUNT(DISTINCT(col1)) FROM foo;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=6756.42..6756.43 rows=1 width=8) (actual time=819.028..819.041 rows=1 loops=1)    ->  Index Only Scan using foo_idx on foo (cost=6506.42..6506.42 rows=100000 width=71) (actual time=0.046..404.260 rows=100000 loops=1)
         Heap Fetches: 100000
         Heap Prefetches: 1334
 Planning Time: 0.438 ms
 Execution Time: 819.515 ms

The cost of the Aggregate node is in both cases the same (250.0) while its runtime is 1.3s in the unsorted case and 0.4s in the pre-sorted case.

Also, why does the Aggregate node sort itself? Why don't we instead emit an explicit Sort node in the plan so that it's visible to the user what is happening? As soon as there's also a GROUP BY in the query, a Sort node occurs in the plan. This seems inconsistent.

--
David Geier
(ServiceNow)



Reply via email to