On Wed, 27 Nov 2024 at 19:51, jian he <jian.universal...@gmail.com> wrote: > v3 attached. in v3:
1. I find the following quite strange: postgres=# create table abcd (a int primary key, b int not null, c int not null, d int not null, unique(b)); CREATE TABLE postgres=# explain select b,c from abcd group by b,c; QUERY PLAN -------------------------------------------------------------- HashAggregate (cost=37.75..56.25 rows=1850 width=8) Group Key: b, c -> Seq Scan on abcd (cost=0.00..28.50 rows=1850 width=8) (3 rows) postgres=# alter table abcd drop constraint abcd_pkey; ALTER TABLE postgres=# explain select b,c from abcd group by b,c; QUERY PLAN -------------------------------------------------------------- HashAggregate (cost=33.12..51.62 rows=1850 width=8) Group Key: b -> Seq Scan on abcd (cost=0.00..28.50 rows=1850 width=8) (3 rows) Why does the patch only try to remove GROUP BY columns using unique indexes when there's no primary key? I don't really see why primary key should be treated specially. Why does the patch not just unify the logic and collect up all unique non-expression index, non-partial indexes where all columns are NOT NULL. You could maybe just add a special case to skip the NOT NULL checking for indexes with indisprimary set. 2. In get_unique_not_null_attnos(), not_null_attnos could be a Bitmapset with attnums offset by FirstLowInvalidHeapAttributeNumber. That'll allow you to do a bms_is_member() call rather than a (possibly expensive) list_member_int() call. 3. The logic in remove_useless_groupby_columns() looks much more complex than it needs to be. It would be much more simple to find the matching unique index with the least number of columns and use that one. Just have a counter to record the bms_num_members() of the columns of the best match so far and replace it when you find an index with fewer columns. Please see adjust_group_pathkeys_for_groupagg() for inspiration. We also need to think about if the unique index with the least columns is always the best one to use. Perhaps the index with the least columns is a varlena column and there's some index with, say, two INT4s. It would be much cheaper to hash a couple of INT4s than some long varlena column. Maybe it's ok just to leave an XXX comment explaining that we might want to think about doing that one day. Alternatively, summing up the pg_statistic.stawidth values could work. Likely it's better to make the patch work correctly before thinking about that part. The patch could also use a spell check: "a input" (an input) "soure" source?? "GROUOP BY" David