Hi, On 09/04/2018 04:16 PM, Dean Rasheed wrote: > On 3 September 2018 at 00:17, Tomas Vondra <tomas.von...@2ndquadrant.com> > wrote: >> Hi, >> >> Attached is an updated version of the patch series, adopting a couple of >> improvements - both for MCV lists and histograms. >> >> >> MCV >> --- >> >> For the MCV list part, I've adopted the approach proposed by Dean, using >> base selectivity and using it to correct the non-MCV part. I agree the >> simplicity of the approach is a nice feature, and it seems to produce >> better estimates. I'm not sure I understand the approach perfectly, but >> I've tried to add comments explaining how it works etc. >> > > Cool. Looking at this afresh after some time away, it still looks like > a reasonable approach, and the test results are encouraging. > > In mcv_clauselist_selectivity(), you raise the following question: > > if (matches[i] != STATS_MATCH_NONE) > { > /* XXX Shouldn't the basesel be outside the if condition? */ > *basesel += mcv->items[i]->base_frequency; > s += mcv->items[i]->frequency; > } > > The reason it needs to be inside the "if" block is that what it's > computing is the base (independent) selectivity of the clauses found > to match the MCV stats, so that then in > statext_clauselist_selectivity() it can be used in the following > computation: > > /* Estimated selectivity of values not covered by MCV matches */ > other_sel = simple_sel - mcv_basesel; > > to give an estimate for the other clauses that aren't covered by the > MCV stats. So I think the code is correct as it stands, but if you > think it isn't clear enough, maybe a comment update is in order. > > The assumption being made is that mcv_basesel will cancel out the part > of simple_sel that is due to clauses matching the MCV stats, so that > we can then just add on the MCV selectivity. Of course that's only an > approximation, and it won't be exact -- partly due to the fact that > simple_sel and mcv_basesel are potentially computed using different > sample rows, and so will differ in the MCV region, and partly because > of second-order effects arising from the way that selectivities are > combined in clauselist_selectivity_simple(). Maybe that's something > that can be improved upon, but it doesn't seem like a bad initial > approximation. >
Thanks for the clarification. It's one of the comments I added while reworking the patch, with still a very limited understanding of the approach at that point in time. I'll replace it with a comment explaining the reasoning in the next version. > >> I've also changed how we build the MCV lists, particularly how we decide >> how many / which items store in the MCV list. In the previous version >> I've adopted the same algorithm we use for per-column MCV lists, but in >> certain cases that turned out to be too restrictive. >> >> Consider for example a table with multiple perfectly correlated columns, >> with very few combinations. That is, something like this: >> >> CREATE TABLE t (a int, b int); >> >> INSERT INTO t SELECT mod(i,50), mod(i,50) >> FROM generate_series(1,1e6) s(i); >> >> CREATE STATISTICS s (mcv) ON a,b FROM t; >> >> Now, the data distribution is very simple - uniform, with 50 distinct >> combinations, each representing 2% of data (and the random sample should >> be pretty close to that). >> >> In these cases, analyze_mcv_list decides it does not need any MCV list, >> because the frequency for each value is pretty much 1/ndistinct. For >> single column that's reasonable, but for multiple correlated columns >> it's rather problematic. We might use the same ndistinct approach >> (assuming we have the ndistinct coefficients), but that still does not >> allow us to decide which combinations are "valid" with respect to the >> data. For example we can't decide (1,10) does not appear in the data. >> >> So I'm not entirely sure adopting the same algorithm analyze_mcv_list >> algorithm both for single-column and multi-column stats. It may make >> sense to keep more items in the multi-column case for reasons that are >> not really valid for a single single-column. >> >> For now I've added a trivial condition to simply keep all the groups >> when possible. This probably needs more thought. >> > > Ah, this is a good point. I think I see the problem here. > > analyze_mcv_list() works by keeping those MCV entries that are > statistically significantly more frequent than the selectivity that > would have otherwise been assigned to the values, which is based on > ndistinct and nullfrac. That's not really right for multivariate stats > though, because the selectivity that would be assigned to a > multi-column value if it weren't in the multivariate MCV list is > actually calculated using the product of individual column > selectivities. Fortunately we now calculate this (base_frequency), so > actually I think what's needed is a custom version of > analyze_mcv_list() that keeps MCV entries if the observed frequency is > statistically significantly larger than the base frequency, not the > ndistinct-based frequency. > That's probably a good idea and should be an improvement in some cases. But I think it kinda misses the second part, when a combination is much less common than the simple product of selectivities (i.e. values that are somehow "incompatible"). In the example above, there are only (a,b) combinations where (a == b), so there's nothing like that. But in practice there will be some sort of noise. So you may observe combinations where the actual frequency is much lower than the base frequency. I suppose "significantly larger than the base frequency" would not catch that, so we need to also consider cases where it's significantly lower. I also wonder if we could/should consider the multi-variate ndistinct estimate, somehow. For the univariate case wen use the ndistinct to estimate the average selectivity for non-MCV items. I think it'd be a mistake not to leverage this knowledge (when ndistinct coefficients are available), even if it only helps with simple equality clauses. > It might also be worthwhile doing a little more work to make the > base_frequency values more consistent with the way individual column > selectivities are actually calculated -- currently the patch always > uses the observed single-column frequencies to calculate the base > frequencies, but actually the univariate stats would only do that for > a subset of the single-column values, and the rest would get assigned > a fixed share of the remaining selectivity-space. Factoring that into > the base frequency calculation ought to give a better base frequency > estimate (for use in mcv_clauselist_selectivity() and > statext_clauselist_selectivity()), as well as give a more principled > cutoff threshold for deciding which multivariate MCV values to keep. > It may be possible to reuse some of the existing code for that. > I agree, but I think we can leave this for later. > The initial goal of the base frequency calculation was to replicate > the univariate stats computations, so that it can be used to give the > right correction to be applied to the simple_sel value. If it can also > be used to determine how many MCV entries to keep, that's an added > bonus. > Yep. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services