Re: using extended statistics to improve join estimates

Zhihong Yu Wed, 31 Mar 2021 16:44:37 -0700

Hi,

+ * has_matching_mcv
+ *     Check whether the list contains statistic of a given kind


The method name is find_matching_mcv(). It seems the method initially
returned bool but later the return type was changed.

+   StatisticExtInfo *found = NULL;

found normally is associated with bool return value. Maybe call the
variable matching_mcv or something similar.

+           /* skip items eliminated by restrictions on rel2 */
+           if (conditions2 && !conditions2[j])
+               continue;

Maybe you can add a counter recording the number of non-skipped items for
the inner loop. If this counter is 0 after the completion of one iteration,
we come out of the outer loop directly.

Cheers

On Wed, Mar 31, 2021 at 10:36 AM Tomas Vondra <tomas.von...@enterprisedb.com>
wrote:

> Hi,
>
> So far the extended statistics are applied only at scan level, i.e. when
> estimating selectivity for individual tables. Which is great, but joins
> are a known challenge, so let's try doing something about it ...
>
> Konstantin Knizhnik posted a patch [1] using functional dependencies to
> improve join estimates in January. It's an interesting approach, but as
> I explained in that thread I think we should try a different approach,
> similar to how we use MCV lists without extended stats. We'll probably
> end up considering functional dependencies too, but probably only as a
> fallback (similar to what we do for single-relation estimates).
>
> This is a PoC demonstrating the approach I envisioned. It's incomplete
> and has various limitations:
>
> - no expression support, just plain attribute references
> - only equality conditions
> - requires MCV lists on both sides
> - inner joins only
>
> All of this can / should be relaxed later, of course. But for a PoC this
> seems sufficient.
>
> The basic principle is fairly simple, and mimics what eqjoinsel_inner
> does. Assume we have a query:
>
>   SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b)
>
> If we have MCV lists on (t1.a,t1.b) and (t2.a,t2.b) then we can use the
> same logic as eqjoinsel_inner and "match" them together. If the MCV list
> is "larger" - e.g. on (a,b,c) - then it's a bit more complicated, but
> the general idea is the same.
>
> To demonstrate this, consider a very simple example with a table that
> has a lot of dependency between the columns:
>
> ==================================================================
>
> CREATE TABLE t (a INT, b INT, c INT, d INT);
> INSERT INTO t SELECT mod(i,100), mod(i,100), mod(i,100), mod(i,100)
>   FROM generate_series(1,100000) s(i);
> ANALYZE t;
>
> SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);
>
> CREATE STATISTICS s (mcv, ndistinct) ON a,b,c,d FROM t;
> ANALYZE t;
>
> SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);
>
> ALTER STATISTICS s SET STATISTICS 10000;
> ANALYZE t;
>
> SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);
>
> ==================================================================
>
> The results look like this:
>
> - actual rows:             100000000
> - estimated (no stats):      1003638
> - estimated (stats, 100):  100247844
> - estimated (stats, 10k):  100000000
>
> So, in this case the extended stats help quite a bit, even with the
> default statistics target.
>
> However, there are other things we can do. For example, we can use
> restrictions (at relation level) as "conditions" to filter the MCV lits,
> and calculate conditional probability. This is useful even if there's
> just a single join condition (on one column), but there are dependencies
> between that column and the other filters. Or maybe when there are
> filters between conditions on the two sides.
>
> Consider for example these two queries:
>
> SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
>  WHERE t1.c < 25 AND t2.d < 25;
>
> SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
>  WHERE t1.c < 25 AND t2.d > 75;
>
> In this particular case we know that (a = b = c = d) so the two filters
> are somewhat redundant. The regular estimates will ignore that, but with
> MCV we can actually detect that - when we combine the two MCV lists, we
> essentially calculate MCV (a,b,t1.c,t2.d), and use that.
>
>                                   Q1          Q2
> - actual rows:              25000000           0
> - estimated (no stats):        62158       60241
> - estimated (stats, 100):   25047900           1
> - estimated (stats, 10k):   25000000           1
>
> Obviously, the accuracy depends on how representative the MCV list is
> (what fraction of the data it represents), and in this case it works
> fairly nicely. A lot of the future work will be about handling cases
> when it represents only part of the data.
>
> The attached PoC patch has a number of FIXME and XXX, describing stuff I
> ignored to keep it simple, possible future improvement. And so on.
>
>
> regards
>
>
> [1]
>
> https://www.postgresql.org/message-id/flat/71d67391-16a9-3e5e-b5e4-8f7fd32cc...@postgrespro.ru
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

Re: using extended statistics to improve join estimates

Reply via email to