Re: [HACKERS] PATCH: multivariate histograms and MCV lists

Tomas Vondra Sun, 09 Sep 2018 14:51:02 -0700

Hi,

On 09/04/2018 04:16 PM, Dean Rasheed wrote:
> On 3 September 2018 at 00:17, Tomas Vondra <tomas.von...@2ndquadrant.com> 
> wrote:
>> Hi,
>>
>> Attached is an updated version of the patch series, adopting a couple of
>> improvements - both for MCV lists and histograms.
>>
>>
>> MCV
>> ---
>>
>> For the MCV list part, I've adopted the approach proposed by Dean, using
>> base selectivity and using it to correct the non-MCV part. I agree the
>> simplicity of the approach is a nice feature, and it seems to produce
>> better estimates. I'm not sure I understand the approach perfectly, but
>> I've tried to add comments explaining how it works etc.
>>
> 
> Cool. Looking at this afresh after some time away, it still looks like
> a reasonable approach, and the test results are encouraging.
> 
> In mcv_clauselist_selectivity(), you raise the following question:
> 
>     if (matches[i] != STATS_MATCH_NONE)
>     {
>         /* XXX Shouldn't the basesel be outside the if condition? */
>         *basesel += mcv->items[i]->base_frequency;
>         s += mcv->items[i]->frequency;
>     }
> 
> The reason it needs to be inside the "if" block is that what it's
> computing is the base (independent) selectivity of the clauses found
> to match the MCV stats, so that then in
> statext_clauselist_selectivity() it can be used in the following
> computation:
> 
>     /* Estimated selectivity of values not covered by MCV matches */
>     other_sel = simple_sel - mcv_basesel;
> 
> to give an estimate for the other clauses that aren't covered by the
> MCV stats. So I think the code is correct as it stands, but if you
> think it isn't clear enough, maybe a comment update is in order.
> 
> The assumption being made is that mcv_basesel will cancel out the part
> of simple_sel that is due to clauses matching the MCV stats, so that
> we can then just add on the MCV selectivity. Of course that's only an
> approximation, and it won't be exact -- partly due to the fact that
> simple_sel and mcv_basesel are potentially computed using different
> sample rows, and so will differ in the MCV region, and partly because
> of second-order effects arising from the way that selectivities are
> combined in clauselist_selectivity_simple(). Maybe that's something
> that can be improved upon, but it doesn't seem like a bad initial
> approximation.
>


Thanks for the clarification. It's one of the comments I added while
reworking the patch, with still a very limited understanding of the
approach at that point in time. I'll replace it with a comment
explaining the reasoning in the next version.

> 
>> I've also changed how we build the MCV lists, particularly how we decide
>> how many / which items store in the MCV list. In the previous version
>> I've adopted the same algorithm we use for per-column MCV lists, but in
>> certain cases that turned out to be too restrictive.
>>
>> Consider for example a table with multiple perfectly correlated columns,
>> with very few combinations. That is, something like this:
>>
>>     CREATE TABLE t (a int, b int);
>>
>>     INSERT INTO t SELECT mod(i,50), mod(i,50)
>>       FROM generate_series(1,1e6) s(i);
>>
>>     CREATE STATISTICS s (mcv) ON a,b FROM t;
>>
>> Now, the data distribution is very simple - uniform, with 50 distinct
>> combinations, each representing 2% of data (and the random sample should
>> be pretty close to that).
>>
>> In these cases, analyze_mcv_list decides it does not need any MCV list,
>> because the frequency for each value is pretty much 1/ndistinct. For
>> single column that's reasonable, but for multiple correlated columns
>> it's rather problematic. We might use the same ndistinct approach
>> (assuming we have the ndistinct coefficients), but that still does not
>> allow us to decide which combinations are "valid" with respect to the
>> data. For example we can't decide (1,10) does not appear in the data.
>>
>> So I'm not entirely sure adopting the same algorithm analyze_mcv_list
>> algorithm both for single-column and multi-column stats. It may make
>> sense to keep more items in the multi-column case for reasons that are
>> not really valid for a single single-column.
>>
>> For now I've added a trivial condition to simply keep all the groups
>> when possible. This probably needs more thought.
>>
> 
> Ah, this is a good point. I think I see the problem here.
> 
> analyze_mcv_list() works by keeping those MCV entries that are
> statistically significantly more frequent than the selectivity that
> would have otherwise been assigned to the values, which is based on
> ndistinct and nullfrac. That's not really right for multivariate stats
> though, because the selectivity that would be assigned to a
> multi-column value if it weren't in the multivariate MCV list is
> actually calculated using the product of individual column
> selectivities. Fortunately we now calculate this (base_frequency), so
> actually I think what's needed is a custom version of
> analyze_mcv_list() that keeps MCV entries if the observed frequency is
> statistically significantly larger than the base frequency, not the
> ndistinct-based frequency.
> 

That's probably a good idea and should be an improvement in some cases.
But I think it kinda misses the second part, when a combination is much
less common than the simple product of selectivities (i.e. values that
are somehow "incompatible").

In the example above, there are only (a,b) combinations where (a == b),
so there's nothing like that. But in practice there will be some sort of
noise. So you may observe combinations where the actual frequency is
much lower than the base frequency.

I suppose "significantly larger than the base frequency" would not catch
that, so we need to also consider cases where it's significantly lower.

I also wonder if we could/should consider the multi-variate ndistinct
estimate, somehow. For the univariate case wen use the ndistinct to
estimate the average selectivity for non-MCV items. I think it'd be a
mistake not to leverage this knowledge (when ndistinct coefficients are
available), even if it only helps with simple equality clauses.

> It might also be worthwhile doing a little more work to make the
> base_frequency values more consistent with the way individual column
> selectivities are actually calculated -- currently the patch always
> uses the observed single-column frequencies to calculate the base
> frequencies, but actually the univariate stats would only do that for
> a subset of the single-column values, and the rest would get assigned
> a fixed share of the remaining selectivity-space. Factoring that into
> the base frequency calculation ought to give a better base frequency
> estimate (for use in mcv_clauselist_selectivity() and
> statext_clauselist_selectivity()), as well as give a more principled
> cutoff threshold for deciding which multivariate MCV values to keep.
> It may be possible to reuse some of the existing code for that.
> 

I agree, but I think we can leave this for later.

> The initial goal of the base frequency calculation was to replicate
> the univariate stats computations, so that it can be used to give the
> right correction to be applied to the simple_sel value. If it can also
> be used to determine how many MCV entries to keep, that's an added
> bonus.
> 

Yep.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

Reply via email to