On Mon, Mar 15, 2021 at 8:42 PM Konstantin Knizhnik < k.knizh...@postgrespro.ru> wrote:
> > > On 11.03.2021 03:47, Tomas Vondra wrote: > > Hi Konstantin, > > > > Thanks for working on this! Using extended statistics to improve join > > cardinality estimates was definitely on my radar, and this patch seems > > like a good start. > > > > I had two basic ideas about how we might improve join estimates: > > > > (a) use per-table extended statistics to estimate join conditions > > > > (b) invent multi-table extended statistics (requires inventing how to > > sample the tables in a coordinated way, etc.) > > > > This patch aims to do (a) which is perfectly reasonable - I think we can > > achieve significant improvements this way. I have some ideas about (b), > > but it seems harder and for a separate thread/patch. > > > > > > The patch includes some *very* interesting ideas, but I think it's does > > them too late and at the wrong level of abstraction. I mean that: > > > > 1) I don't think the code in clausesel.c should deal with extended > > statistics directly - it requires far too much knowledge about different > > types of extended stats, what clauses are supported by them, etc. > > Allowing stats on expressions will make this even worse. > > > > Better do that in extended_stats.c, like statext_clauselist_selectivity. > > > > 2) in clauselist_selectivity_ext, functional dependencies are applied in > > the part that processes remaining clauses, not estimated using extended > > statistics. That seems a bit confusing, and I suspect it may lead to > > issues - for example, it only processes the clauses incrementally, in a > > particular order. That probably affects the result, because it affects > > which functional dependencies we can apply. > > > > In the example query that's not an issue, because it only has two Vars, > > so it either can't apply anything (with one Var) or it can apply > > everything (with two Vars). But with 3 or more Vars the order would > > certainly matter, so it's problematic. > > > > > > Moreover, it seems a bit strange that this considers dependencies only > > on the inner relation. Can't that lead to issues with different join > > orders producing different cardinality estimates? > > > > > > I think a better approach would be to either modify the existing block > > dealing with extended stats for a single relation to also handle join > > conditions. Or perhaps we should invent a separate block, dealing with > > *pairs* of relations? And it should deal with *all* join clauses for > > that pair of relations at once, not one by one. > > > > As for the exact implementation, I'd imagine we call overall logic to be > > something like (for clauses on two joined relations): > > > > - pick a subset of clauses with the same type of extended statistics on > > both sides (MCV, ndistinct, ...), repeat until we can't apply more > > statistics > > > > - estimate remaining clauses either using functional dependencies or in > > the regular (old) way > > > > > > As for how to use other types of extended statistics, I think eqjoinsel > > could serve as an inspiration. We should look for an MCV list and > > ndistinct stats on both sides of the join (possibly on some subset of > > clauses), and then do the same thing eqjoinsel does, just with multiple > > columns. > > > > Note: I'm not sure what to do when we find the stats only on one side. > > Perhaps we should assume the other side does not have correlations and > > use per-column statistics (seems reasonable), or maybe just not apply > > anything (seems a bit too harsh). > > > > Anyway, if there are some non-estimated clauses, we could try applying > > functional dependencies similarly to what this patch does. It's also > > consistent with statext_clauselist_selectivity - that also tries to > > apply MCV lists first, and only then we try functional dependencies. > > > > > > BTW, should this still rely on oprrest (e.g. F_EQSEL). That's the > > selectivity function for restriction (non-join) clauses, so maybe we > > should be looking at oprjoin when dealing with joins? Not sure. > > > > > > One bit that I find *very* interesting is the calc_joinrel_size_estimate > > part, with this comment: > > > > /* > > * Try to take in account functional dependencies between attributes > > * of clauses pushed-down to joined relations and retstrictlist > > * clause. Right now we consider only case of restrictlist consists of > > * one clause. > > */ > > > > If I understand the comment and the code after it, it essentially tries > > to apply extended statistics from both the join clauses and filters at > > the relation level. That is, with a query like > > > > SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a) WHERE t1.b = 10 > > > > we would be looking at statistics on t1(a,b), because we're interested > > in estimating conditional probability distribution > > > > P(t1.a = a? | t1.b = 10) > > > > I think that's extremely interesting and powerful, because it allows us > > to "restrict" the multi-column MCV lists, we could probably estimate > > number of distinct "a" values in rows with "b=10" like: > > > > ndistinct(a,b) / ndistinct(b) > > > > and do various interesting stuff like this. > > > > That will require some improvements to the extended statistics code (to > > allow passing a list of conditions), but that's quite doable. I think > > the code actually did something like that originally ;-) > > > > > > Obviously, none of this is achievable for PG14, as we're in the middle > > of the last CF. But if you're interested in working on this for PG15, > > I'd love to cooperate on that. > > > > > > regards > > > Hi Tomas, > Thank you for review of my patch. > My primary attention was to implement some kid of adaptive query > optimization based online_analyze extension and building extended > statistic on demand. > I have change clausesel.c because right now extended statistic is not > used for join selectivity estimation and manual or automatic adding such > statistic can help to > choose more efficient plan for queries with joins. > I agree wit you that it can be done in better way, handling more use cases. > I will be glad to cooperate with you in improving join selectivity > estimation using extended statistic. > > > > The patch does not compile, and needs your attention. https://cirrus-ci.com/task/6397726985289728 clausesel.c:74:28: error: too few arguments to function ‘choose_best_statistics’ StatisticExtInfo *stat = choose_best_statistics(rel->statlist, STATS_EXT_DEPENDENCIES, ^~~~~~~~~~~~~~~~~~~~~~ In file included from clausesel.c:24: ../../../../src/include/statistics/statistics.h:123:26: note: declared here exter I am changing the status to "Waiting on Author". -- Ibrar Ahmed