On 08/03/2018 04:24 PM, Dean Rasheed wrote:
On 17 July 2018 at 14:03, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:
For equalities it's going to be hard. The only thing I can think of at the
moment is checking if there are any matching buckets at all, and using that
to decide whether to extrapolate the MCV selectivity to the non-MCV part or
not (or perhaps to what part of the non-MCV part).
So I decided to play a little more with this, experimenting with a
much simpler approach -- this is for MCV's only at the moment, see the
attached (very much WIP) patch (no doc or test updates, and lots of
areas for improvement).
The basic idea when building the MCV stats is to not just record the
frequency of each combination of values, but also what I'm calling the
"base frequency" -- that is the frequency that that combination of
values would have if the columns were independent (i.e., the product
of each value's individual frequency).
The reasoning then, is that if we find an MCV entry matching the query
clauses, the difference (frequency - base_frequency) can be viewed as
a correction to be applied to the selectivity returned by
clauselist_selectivity_simple(). If all possible values were covered
by matching MCV entries, the sum of the base frequencies of the
matching MCV entries would approximately cancel out with the simple
selectivity, and only the MCV frequencies would be left (ignoring
second order effects arising from the fact that
clauselist_selectivity_simple() doesn't just sum up disjoint
possibilities). For partial matches, it will use what multivariate
stats are available to improve upon the simple selectivity.
I wondered about just storing the difference (frequency -
base_frequency) in the stats, but it's actually useful to have both
values, because then the total of all the MCV frequencies can be used
to set an upper bound on the non-MCV part.
The advantage of this approach is that it is very simple, and in
theory ought to be reasonably applicable to arbitrary combinations of
clauses. Also, it naturally falls back to the univariate-based
estimate when there are no matching MCV entries. In fact, even when
there are no matching MCV entries, it can still improve upon the
univariate estimate by capping it to 1-total_mcv_sel.
I tested it with the same data posted previously and a few simple
queries, and the initial results are quite encouraging. Where the
previous patch sometimes gave noticeable over- or under-estimates,
this patch generally did better:
Query Actual rows Est (HEAD) Est (24 Jun patch) Est (new patch)
Q1 50000 12625 48631 49308
Q2 40000 9375 40739 38710
Q3 90000 21644 172688 88018
Q4 140000 52048 267528 138228
Q5 140000 52978 267528 138228
Q6 140000 52050 267528 138228
Q7 829942 777806 149886 822788
Q8 749942 748302 692686 747922
Q9 15000 40989 27595 14131
Q10 15997 49853 27595 23121
Q1: a=1 and b=1
Q2: a=1 and b=2
Q3: a=1 and (b=1 or b=2)
Q4: (a=1 or a=2) and (b=1 or b=2)
Q5: (a=1 or a=2) and (b<=2)
Q6: (a=1 or a=2 or a=4) and (b=1 or b=2)
Q7: (a=1 or a=2) and not (b=2)
Q8: (a=1 or a=2) and not (b=1 or b=2)
Q9: a=3 and b>0 and b<3
Q10: a=3 and b>0 and b<1000
Interesting idea, and the improvements certainly seem encouraging.
I wonder what a counter-example would look like - I think the MCV and
non-MCV parts would have to behave very differently (one perfectly
dependent, the other perfectly independent). But that does seem very
likely, and even if it was there's not much we can do about such cases.
I've not tried anything with histograms. Possibly the histograms could
be used as-is, to replace the non-MCV part (other_sel). Or, a similar
approach could be used, recording the base frequency of each histogram
bucket, and then using that to refine the other_sel estimate. Either
way, I think it would be necessary to exclude equality clauses from
the histograms, otherwise MCVs might end up being double-counted.
I do have an idea about histograms. I didn't have time to hack on it
yet, but I think it could work in combination with your MCV algorithm.
Essentially there are two related issues with histograms:
1) equality conditions
Histograms work nicely with inequalities, not that well for equalities.
For equality clauses, we can estimate the selectivity as 1/ndistinct,
similarly to what we do in 1-D cases (we can use ndistinct coefficients
if we have them, and MCV tracking the common combinations).
If there are both equalities and inequalities, we can then use the
equality clauses merely as condition (to limit the set of buckets), and
evaluate the inequalities for those buckets only. Essentially compute
P(equals + inequals) = P(equals) * P(inequals | equals)
IMHO that should help with estimating selectivity of equality clauses.
2) estimating bucket selectivity
The other question is how to combine selectivities of multiple clauses
for a single bucket. I think the linear approximation (convert_scalar or
something like that) and computing geometric mean (as you proposed) is a
solid plan.
I do have this on my TODO list for this week, unless something urgent
comes up.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services