=?iso-8859-1?Q?Andr=E9_H=E4nsel?= <an...@webkr.de> writes:
> I have a case where Postgres chooses the wrong index and I'm not sure what
> to do about it:

The core problem here seems to be a poor estimate for the selectivity
of "WHERE cropped AND NOT resized":

regression=# EXPLAIN ANALYZE
SELECT count(*) FROM t
WHERE cropped AND NOT resized ;
...
   ->  Bitmap Heap Scan on t  (cost=35.26..6352.26 rows=91100 width=0) (actual 
time=0.121..0.190 rows=1000 loops=1)
         Recheck Cond: (cropped AND (NOT resized))
...

I think this is because the planner expects those two columns to be
independent, which they are completely not in your test data.  Perhaps
that assumption is more true in your real-world data, but since you're
here complaining, I suppose not :-(.  What you can do about that, in
recent Postgres versions, is to create extended statistics on the
combination of the columns:

regression=# create statistics t_stats on cropped, resized from t;
CREATE STATISTICS
regression=# analyze t;
ANALYZE
regression=# EXPLAIN ANALYZE                                      
SELECT count(*) FROM t
WHERE cropped AND NOT resized AND create_date < CURRENT_DATE;
                                                          QUERY PLAN            
                                              
------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3145.15..3145.16 rows=1 width=8) (actual time=9.765..9.766 
rows=1 loops=1)
   ->  Index Scan using idx_resized on t  (cost=0.29..3142.65 rows=1000 
width=0) (actual time=9.608..9.735 rows=1000 loops=1)
         Filter: (cropped AND (create_date < CURRENT_DATE))
         Rows Removed by Filter: 100000
 Planning Time: 0.115 ms
 Execution Time: 9.779 ms

Better estimate, but it's still using the wrong index :-(.  If we force
use of the other one:

regression=# drop index idx_resized;
DROP INDEX
regression=# EXPLAIN ANALYZE
regression-# SELECT count(*) FROM t
regression-# WHERE cropped AND NOT resized AND create_date < CURRENT_DATE;
                                                          QUERY PLAN            
                                               
-------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=6795.38..6795.39 rows=1 width=8) (actual time=0.189..0.191 
rows=1 loops=1)
   ->  Bitmap Heap Scan on t  (cost=13.40..6792.88 rows=1000 width=0) (actual 
time=0.047..0.147 rows=1000 loops=1)
         Recheck Cond: (cropped AND (NOT resized))
         Filter: (create_date < CURRENT_DATE)
         Heap Blocks: exact=6
         ->  Bitmap Index Scan on specific  (cost=0.00..13.15 rows=91565 
width=0) (actual time=0.035..0.035 rows=1000 loops=1)
                                                              ^^^^^^^^^^
 Planning Time: 0.154 ms
 Execution Time: 0.241 ms

it looks like the problem is that the extended stats haven't been used
while forming the estimate of the number of index entries retrieved,
so we overestimate the cost of using this index.

That seems like a bug.  Tomas?

In the meantime, maybe you could dodge the problem by combining
"cropped" and "resized" into one multivalued column, so that there's
not a need to depend on extended stats to arrive at a decent estimate.

                        regards, tom lane


Reply via email to