> Somebody wrote to me a few days ago that the BRIN cost estimation is > rather poor. One immediately obvious issue which I think is easily > fixed is the correlation estimate, which is currently hardcoded to 1. > > Since a BRIN scan always entails a scan of the relation in physical > order, it's simple to produce a better estimate for that, per the > attached patch. (Note: trying to run this on expression indexes will > cause a crash. I need to complete that part ...) > > There are other improvements we've been discussing, but I'm unclear that > I will have time to study them to get a useful patch quickly enough. If > somebody else wants to mess with this, please get in touch. > > Thoughts?
As we have discusses, the correlation is not used for bitmap index scan cost estimation. I think the best we can do in here is to somehow increase the selectivity. I suggest something like this: /* * Index selectivity is important for the planner to calculate the cost * of the bitmap heap scan. Unfortunately, we don't have a robust way to * estimate selectivity of BRIN. It can dependent on many things. This * is a long rationale about the incomplete calculation we have at the * moment. * * Our starting point is that BRIN selectivity has to be less than the * selectivity of the btree. We are using a product of logical and * physical selectivities to achieve this. The equality of * * (1 + logical_selectivity) * (1 + physical_selectivity) - 1 * * is used to make sure the result is not less than any of the values. * * The logical selectivity is calculated using the indexable expressions * of the WHERE clause. The physical selectivity is calculated using * the block proportion and the maximum correlation. The block * proportion is a comparable value with selectivity. It is the * selectivity of the smallest unit of the index. The final selectivity * can never be less than that. * * Using the contrary of the correlation by subtracting it from 1 is not * not really a comparable value with the selectivity. It is just a * value between 0 and 1. On the other hand, it is the only value * related to the BRIN quality, we have available right now. We are * using the arithmetic of it with the block proportion to normalise it. * This part of the physical selectivity is likely to be more effective * than the block proportion in many circumstances as there would be * many blocks on big tables. * * Using the contrary of the correlation of a column as selectivity of * the index is wrong in many ways. First of all, it cannot be applied * to all BRIN operator classes. It makes sense for the main built-in * operator class "minmax", and makes a little sense for the other one * "inclusion". It wouldn't probably make any sense for a bloom filter * implementation, if there would be any. Maybe, we should push down * this function to the operator class, but there is not enough reason * to do it right now. * * Second, correlation is not dependent to any indexed expression. It * probably doesn't make any sense for the complicated operators. It * would probably effect basic comparison operators differently than * equality operator. The effect would even differ by count of those * expressions. For example, x IN (10, 20, 30) would be effected from * correlation more than x = 15, even when their selectivities are the * same. * * Last but not least, the correlation is a single value for the whole * range. The indexed table can partly be very well correlated, but * the correlation value can still be very low. For example, if a * perfectly correlated table is copied 4 times, the correlation would * be 0.25, although the index would be almost as good as the version on * the initial table. Or the expression can match the better correlated * part of the table. It is not hard to imagine more scenarios where * the correlation is a bad value to use as the selectivity. We should * probably improve this by collecting more statistics, one day. * * Another problem in here is that the caller assumes the selectivity * by tuples. It might have been better, if we had a way to return it * as some number of pages. On the other hand, even though we know about * the index, it is not too much easier for us to estimate the number of * matching pages then it is for the caller. We are likely to make too * much mistake by relying on the correlation, anyway. We are at least * not making it worse in here. */ blockSelectivity = (blockProportion + 1 - *indexCorrelation) / 2; selec = (1 + qualSelectivity) * (1 + blockSelectivity) - 1; CLAMP_PROBABILITY(selec); The patch is attached. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers