Maciek Sakrejda <mac...@pganalyze.com> writes: > In a blog post [1], Bruce Momjian notes that expression indexes can > help with planning even if they're not used directly. But the examples > cited in that post are vague (i.e., they improve stats, but it's not > clear how they could change plans), and Bruce's answer to a comment > [2] suggests that this is not documented.
> Is there any more info on this mechanism? Specifically, if one has > unused expression indexes (according to pg_stat_user_indexes), is it > safe to drop them? Or could they be providing statistics that > materially affect query planning even though the indexes themselves > are unused? Expression indexes definitely can affect planning, because ANALYZE collects stats on the values of those expressions. As a trivial example, regression=# create table foo (x1 float8); CREATE TABLE regression=# insert into foo select 10 * random() from generate_series(1,10000); INSERT 0 10000 regression=# analyze foo; ANALYZE regression=# explain analyze select * from foo where sqrt(x1) < 1; QUERY PLAN ----------------------------------------------------------------------------------------------------- Seq Scan on foo (cost=0.00..195.00 rows=3333 width=8) (actual time=0.009..0.546 rows=1028 loops=1) Filter: (sqrt(x1) < '1'::double precision) Rows Removed by Filter: 8972 Planning Time: 0.065 ms Execution Time: 0.572 ms (5 rows) The planner has no info about the values of sqrt(x1), so you get a default estimate (one-third) of the selectivity of the WHERE clause. But watch this: regression=# create index on foo (sqrt(x1)); CREATE INDEX regression=# analyze foo; ANALYZE regression=# explain analyze select * from foo where sqrt(x1) < 1; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on foo (cost=24.24..84.63 rows=1026 width=8) (actual time=0.078..0.229 rows=1028 loops=1) Recheck Cond: (sqrt(x1) < '1'::double precision) Heap Blocks: exact=45 -> Bitmap Index Scan on foo_sqrt_idx (cost=0.00..23.98 rows=1026 width=0) (actual time=0.068..0.068 rows=1028 loops=1) Index Cond: (sqrt(x1) < '1'::double precision) Planning Time: 0.113 ms Execution Time: 0.259 ms (7 rows) Now there are stats about the values of sqrt(x1), allowing a far more accurate selectivity estimate to be made. In this particular example there's no change of plan (it would have used the index anyway), but certainly a different rowcount estimate can make a big difference. This mechanism is quite ancient, and in principle it's now superseded by extended statistics. For example, I can drop this index and instead do regression=# drop index foo_sqrt_idx; DROP INDEX regression=# create statistics foostats on sqrt(x1) from foo; CREATE STATISTICS regression=# analyze foo; ANALYZE regression=# explain analyze select * from foo where sqrt(x1) < 1; QUERY PLAN ----------------------------------------------------------------------------------------------------- Seq Scan on foo (cost=0.00..195.00 rows=1026 width=8) (actual time=0.006..0.479 rows=1028 loops=1) Filter: (sqrt(x1) < '1'::double precision) Rows Removed by Filter: 8972 Planning Time: 0.079 ms Execution Time: 0.503 ms (5 rows) So the accurate rowcount estimate is still obtained in this example; and we're not incurring any index maintenance costs, only ANALYZE costs that are going to be roughly the same either way. However, I am not certain that extended statistics are plugged into all the places where the older mechanism applies. Tomas Vondra might have a better idea than I of where gaps remain in that. regards, tom lane