Hi, + * has_matching_mcv + * Check whether the list contains statistic of a given kind
The method name is find_matching_mcv(). It seems the method initially returned bool but later the return type was changed. + StatisticExtInfo *found = NULL; found normally is associated with bool return value. Maybe call the variable matching_mcv or something similar. + /* skip items eliminated by restrictions on rel2 */ + if (conditions2 && !conditions2[j]) + continue; Maybe you can add a counter recording the number of non-skipped items for the inner loop. If this counter is 0 after the completion of one iteration, we come out of the outer loop directly. Cheers On Wed, Mar 31, 2021 at 10:36 AM Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > Hi, > > So far the extended statistics are applied only at scan level, i.e. when > estimating selectivity for individual tables. Which is great, but joins > are a known challenge, so let's try doing something about it ... > > Konstantin Knizhnik posted a patch [1] using functional dependencies to > improve join estimates in January. It's an interesting approach, but as > I explained in that thread I think we should try a different approach, > similar to how we use MCV lists without extended stats. We'll probably > end up considering functional dependencies too, but probably only as a > fallback (similar to what we do for single-relation estimates). > > This is a PoC demonstrating the approach I envisioned. It's incomplete > and has various limitations: > > - no expression support, just plain attribute references > - only equality conditions > - requires MCV lists on both sides > - inner joins only > > All of this can / should be relaxed later, of course. But for a PoC this > seems sufficient. > > The basic principle is fairly simple, and mimics what eqjoinsel_inner > does. Assume we have a query: > > SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b) > > If we have MCV lists on (t1.a,t1.b) and (t2.a,t2.b) then we can use the > same logic as eqjoinsel_inner and "match" them together. If the MCV list > is "larger" - e.g. on (a,b,c) - then it's a bit more complicated, but > the general idea is the same. > > To demonstrate this, consider a very simple example with a table that > has a lot of dependency between the columns: > > ================================================================== > > CREATE TABLE t (a INT, b INT, c INT, d INT); > INSERT INTO t SELECT mod(i,100), mod(i,100), mod(i,100), mod(i,100) > FROM generate_series(1,100000) s(i); > ANALYZE t; > > SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b); > > CREATE STATISTICS s (mcv, ndistinct) ON a,b,c,d FROM t; > ANALYZE t; > > SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b); > > ALTER STATISTICS s SET STATISTICS 10000; > ANALYZE t; > > SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b); > > ================================================================== > > The results look like this: > > - actual rows: 100000000 > - estimated (no stats): 1003638 > - estimated (stats, 100): 100247844 > - estimated (stats, 10k): 100000000 > > So, in this case the extended stats help quite a bit, even with the > default statistics target. > > However, there are other things we can do. For example, we can use > restrictions (at relation level) as "conditions" to filter the MCV lits, > and calculate conditional probability. This is useful even if there's > just a single join condition (on one column), but there are dependencies > between that column and the other filters. Or maybe when there are > filters between conditions on the two sides. > > Consider for example these two queries: > > SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b) > WHERE t1.c < 25 AND t2.d < 25; > > SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b) > WHERE t1.c < 25 AND t2.d > 75; > > In this particular case we know that (a = b = c = d) so the two filters > are somewhat redundant. The regular estimates will ignore that, but with > MCV we can actually detect that - when we combine the two MCV lists, we > essentially calculate MCV (a,b,t1.c,t2.d), and use that. > > Q1 Q2 > - actual rows: 25000000 0 > - estimated (no stats): 62158 60241 > - estimated (stats, 100): 25047900 1 > - estimated (stats, 10k): 25000000 1 > > Obviously, the accuracy depends on how representative the MCV list is > (what fraction of the data it represents), and in this case it works > fairly nicely. A lot of the future work will be about handling cases > when it represents only part of the data. > > The attached PoC patch has a number of FIXME and XXX, describing stuff I > ignored to keep it simple, possible future improvement. And so on. > > > regards > > > [1] > > https://www.postgresql.org/message-id/flat/71d67391-16a9-3e5e-b5e4-8f7fd32cc...@postgrespro.ru > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >