Hi,

When looking at [1], I realized we may have a side effect when removing
redundant columns in the GROUP BY clause. Suppose we have a query with
ORDER BY 'b', and meanwhile column 'b' is also a group key. If we decide
that 'b' is redundant due to being functionally dependent on other GROUP
BY columns, we would remove it from group keys. This will make us lose
the opportunity to leverage the index on 'b'.

Here is an example for illustration.

# create table t (a int primary key, b int);
# insert into t select i, i%1000 from generate_series(1,1000000)i;
# create index on t(b);

By default, we remove 'b' from group keys and generate a plan as below:

# explain (costs off) select b from t group by a, b order by b limit 10;
                   QUERY PLAN
------------------------------------------------
 Limit
   ->  Sort
         Sort Key: b
         ->  Group
               Group Key: a
               ->  Index Scan using t_pkey on t
(6 rows)

The index on 'b' is not being used and we'll have to retrieve all the
data underneath to perform the sort work.

On the other hand, if we keep 'b' as a group column, we can get such a
plan as:

# explain (costs off) select b from t group by a, b order by b limit 10;
                   QUERY PLAN
-------------------------------------------------
 Limit
   ->  Group
         Group Key: b, a
         ->  Incremental Sort
               Sort Key: b, a
               Presorted Key: b
               ->  Index Scan using t_b_idx on t
(7 rows)

With the help of 't_b_idx', we can avoid the full scan on 't' and it
would run much faster.

Any thoughts?

[1]
https://www.postgresql.org/message-id/flat/16869-26346b77d6ccaeec%40postgresql.org

Thanks
Richard

Reply via email to