Hi Konstantin, Thanks for working on this! Using extended statistics to improve join cardinality estimates was definitely on my radar, and this patch seems like a good start.
I had two basic ideas about how we might improve join estimates: (a) use per-table extended statistics to estimate join conditions (b) invent multi-table extended statistics (requires inventing how to sample the tables in a coordinated way, etc.) This patch aims to do (a) which is perfectly reasonable - I think we can achieve significant improvements this way. I have some ideas about (b), but it seems harder and for a separate thread/patch. The patch includes some *very* interesting ideas, but I think it's does them too late and at the wrong level of abstraction. I mean that: 1) I don't think the code in clausesel.c should deal with extended statistics directly - it requires far too much knowledge about different types of extended stats, what clauses are supported by them, etc. Allowing stats on expressions will make this even worse. Better do that in extended_stats.c, like statext_clauselist_selectivity. 2) in clauselist_selectivity_ext, functional dependencies are applied in the part that processes remaining clauses, not estimated using extended statistics. That seems a bit confusing, and I suspect it may lead to issues - for example, it only processes the clauses incrementally, in a particular order. That probably affects the result, because it affects which functional dependencies we can apply. In the example query that's not an issue, because it only has two Vars, so it either can't apply anything (with one Var) or it can apply everything (with two Vars). But with 3 or more Vars the order would certainly matter, so it's problematic. Moreover, it seems a bit strange that this considers dependencies only on the inner relation. Can't that lead to issues with different join orders producing different cardinality estimates? I think a better approach would be to either modify the existing block dealing with extended stats for a single relation to also handle join conditions. Or perhaps we should invent a separate block, dealing with *pairs* of relations? And it should deal with *all* join clauses for that pair of relations at once, not one by one. As for the exact implementation, I'd imagine we call overall logic to be something like (for clauses on two joined relations): - pick a subset of clauses with the same type of extended statistics on both sides (MCV, ndistinct, ...), repeat until we can't apply more statistics - estimate remaining clauses either using functional dependencies or in the regular (old) way As for how to use other types of extended statistics, I think eqjoinsel could serve as an inspiration. We should look for an MCV list and ndistinct stats on both sides of the join (possibly on some subset of clauses), and then do the same thing eqjoinsel does, just with multiple columns. Note: I'm not sure what to do when we find the stats only on one side. Perhaps we should assume the other side does not have correlations and use per-column statistics (seems reasonable), or maybe just not apply anything (seems a bit too harsh). Anyway, if there are some non-estimated clauses, we could try applying functional dependencies similarly to what this patch does. It's also consistent with statext_clauselist_selectivity - that also tries to apply MCV lists first, and only then we try functional dependencies. BTW, should this still rely on oprrest (e.g. F_EQSEL). That's the selectivity function for restriction (non-join) clauses, so maybe we should be looking at oprjoin when dealing with joins? Not sure. One bit that I find *very* interesting is the calc_joinrel_size_estimate part, with this comment: /* * Try to take in account functional dependencies between attributes * of clauses pushed-down to joined relations and retstrictlist * clause. Right now we consider only case of restrictlist consists of * one clause. */ If I understand the comment and the code after it, it essentially tries to apply extended statistics from both the join clauses and filters at the relation level. That is, with a query like SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a) WHERE t1.b = 10 we would be looking at statistics on t1(a,b), because we're interested in estimating conditional probability distribution P(t1.a = a? | t1.b = 10) I think that's extremely interesting and powerful, because it allows us to "restrict" the multi-column MCV lists, we could probably estimate number of distinct "a" values in rows with "b=10" like: ndistinct(a,b) / ndistinct(b) and do various interesting stuff like this. That will require some improvements to the extended statistics code (to allow passing a list of conditions), but that's quite doable. I think the code actually did something like that originally ;-) Obviously, none of this is achievable for PG14, as we're in the middle of the last CF. But if you're interested in working on this for PG15, I'd love to cooperate on that. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company