Many thanks Justin & Jeff for your replies. Presumbly the conditions are partially redundant, so loc_id => user_id
Yes you're right. I had overlooked this. I've done some further testing and this confirms what you say: if the WHERE columns are independent, then the Planner makes a reasonable estimate of the number of rows. irrespective of whether it uses a single index or a "Bitmap And" of two indexes. I've also tested "create statistics" on Postgres 12: - gives good estimate with WHERE user_id = 'USER123' and loc_id = 'LOC12345678' - but Plan Rows = 5 with WHERE user_id = 'USER123' and loc_id = ANY('{ LOC12345678 }'::text[]) - Note: if I omit the user_id condition then it gives a good estimate, i.e. with WHERE loc_id = ANY('{ LOC12345678 }'::text[]) So statistics objects don't seem to be able to handle the combination of dependencies and arrays (at least in 12.2). Steve On Wed, 6 May 2020 at 19:25, Jeff Janes <jeff.ja...@gmail.com> wrote: > On Wed, May 6, 2020 at 12:20 PM Steve Pritchard <steve.pritch...@bto.org> > wrote: > >> Version: Postgres 9.6.3 production system (but also tested on Postgres 12) >> >> For my query the Planner is sometimes choosing an execution plan that >> uses "Bitmap And" (depending on the parameters): >> >> -> Bitmap Heap Scan on observation (cost=484.92..488.93 rows=1 >> width=203) (actual time=233.129..330.886 rows=15636 loops=1) >> Recheck Cond: (((user_id)::text = 'USER123'::text) AND ((loc_id)::text >> = ANY ('{LOC12345678}'::text[]))) >> > > If you change " = ANY(array_of_one)" to " = scalar", does that change > anything? You might be able to fix this (in v12) using CREATE STATISTICS, > but I don't know if that mechanism can see through the ANY(array_of_one) > wrapper. > > >> Note that in cases where the Planner selects a single Index Scan for this >> query (with different parameters), the Planner makes an accurate estimate >> of the number of rows and then makes sensible selections of joins (i.e. >> quick). >> i.e. the issue seems to be with the "Bitmap And". >> > > > I don't know if this nitpick matters, but I don't think that that is how > the planner works. The row estimates work from the top down, not the > bottom up. The row estimate of 1 is based on what conditions the bitmap > heap scan implements, it is not arrived at by combining the estimates from > the index scans below it. If it were to change to a different type of node > but implemented the same conditions, I think it would have the same row > estimate. > > >> >> I don't have an index with both user_id & loc_id, as this is one of >> several different combinations that can arise (it would require quite a few >> indexes to cover all the possible combinations). >> > > Are you actually experiencing problems with those other combinations as > well? If not, I wouldn't worry about solving hypothetical problems. If > those other combinations are actually problems and you go with CREATE > STATISTICS, then you would have to be creating a lot of different > statistics. That would still be ugly, but at least the overhead for > statistics is lower than for indexes. > > >> However if I did have such an index, the planner would presumably be able >> to use the statistics for user_id and loc_id to estimate the number of rows. >> > > Indexes on physical columns do not have statistics, so making that index > would not help with the estimation. (Expressional indexes do have > statistics, but I don't see that helping you here). So while this node > would execute faster with that index, it would still be kicking the unshown > nested loop left join 15,636 times when it thinks it will be doing it > once, and so would still be slow. The most robust solution might be to > make the outer part of that nested loop left join faster, so that your > system would be more tolerant of statistics problems. > > >> >> So why can't it make an accurate estimate of the rows with a "Bitmap And" >> & " Bitmap Heap Scan"? (as above) >> > > In the absence of custom statistics, it assumes the selectivities of user_id > = 'USER123', of loc_id = ANY ('{LOC12345678}'::text[]), and of taxa = > 'Birds' are all independent of each other and can be multiplied to arrive > at the overall selectivity. But clearly that is not the case. Bird > watchers mostly watch near where they live, not in random other places. > > Cheers, > > Jeff > -- Steve Pritchard Database Developer British Trust for Ornithology, The Nunnery, Thetford, Norfolk IP24 2PU, UK Tel: +44 (0)1842 750050, fax: +44 (0)1842 750030 Registered Charity No 216652 (England & Wales) No SC039193 (Scotland) Company Limited by Guarantee No 357284 (England & Wales)