On Wed, Oct 6, 2021 at 12:33 PM Tomas Vondra <tomas.von...@enterprisedb.com> wrote:
> Hi, > > attached is an improved version of this patch, addressing some of the > points mentioned in my last message: > > 1) Adds a couple regression tests, testing various join cases with > expressions, additional conditions, etc. > > 2) Adds support for expressions, so the join clauses don't need to > reference just simple columns. So e.g. this can benefit from extended > statistics, when defined on the expressions: > > -- CREATE STATISTICS s1 ON (a+1), b FROM t1; > -- CREATE STATISTICS s2 ON (a+1), b FROM t2; > > SELECT * FROM t1 JOIN t2 ON ((t1.a + 1) = (t2.a + 1) AND t1.b = t2.b); > > 3) Can combine extended statistics and regular (per-column) statistics. > The previous version required extended statistics MCV on both sides of > the join, but adding extended statistics on both sides may impractical > (e.g. if one side does not have correlated columns it's silly to have to > add it just to make this patch happy). > > For example you may have extended stats on the dimension table but not > the fact table, and the patch still can combine those two. Of course, if > there's no MCV on either side, we can't do much. > > So this patch works when both sides have extended statistics MCV, or > when one side has extended MCV and the other side regular MCV. It might > seem silly, but the extended MCV allows considering additional baserel > conditions (if there are any). > > > examples > ======== > > The table / data is very simple, but hopefully good enough for some > simple examples. > > create table t1 (a int, b int, c int); > create table t2 (a int, b int, c int); > > insert into t1 select mod(i,50), mod(i,50), mod(i,50) > from generate_series(1,1000) s(i); > > insert into t2 select mod(i,50), mod(i,50), mod(i,50) > from generate_series(1,1000) s(i); > > analyze t1, t2; > > First, without extended stats (just the first line of explain analyze, > to keep the message short): > > explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b); > > QUERY PLAN > ------------------------------------------------------------------------ > Hash Join (cost=31.00..106.00 rows=400 width=24) > (actual time=5.426..22.678 rows=20000 loops=1) > > > explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25; > > QUERY PLAN > ------------------------------------------------------------------------ > Hash Join (cost=28.50..160.75 rows=10000 width=24) > (actual time=5.325..19.760 rows=10000 loops=1) > > > explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < > 25 and t2.c > 25; > > QUERY PLAN > ------------------------------------------------------------------------ > Hash Join (cost=24.50..104.75 rows=4800 width=24) > (actual time=5.618..5.639 rows=0 loops=1) > > > Now, let's create statistics: > > create statistics s1 on a, b, c from t1 ; > create statistics s2 on a, b, c from t2 ; > analyze t1, t2; > > and now the same queries again: > > explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b); > > QUERY PLAN > ------------------------------------------------------------------------ > Hash Join (cost=31.00..106.00 rows=20000 width=24) > (actual time=5.448..22.713 rows=20000 loops=1) > > explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25; > > QUERY PLAN > ------------------------------------------------------------------------ > Hash Join (cost=28.50..160.75 rows=10000 width=24) > (actual time=5.317..16.680 rows=10000 loops=1) > > > explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < > 25 and t2.c > 25; > > QUERY PLAN > ------------------------------------------------------------------------ > Hash Join (cost=24.50..104.75 rows=1 width=24) > (actual time=5.647..5.667 rows=0 loops=1) > > > Those examples are a bit simplistic, but the improvements are fairly > clear I think. > > > limitations & open issues > ========================= > > Let's talk about the main general restrictions and open issues in the > current patch that I can think of at the moment. > > 1) statistics covering all join clauses > > The patch requires the statistics to cover all the join clauses, mostly > because it simplifies the implementation. This means that to use the > per-column statistics, there has to be just a single join clause. > > AFAICS this could be relaxed and we could use multiple statistics to > estimate the clauses. But it'd make selection of statistics much more > complicated, because we have to pick "matching" statistics on both sides > of the join. So it seems like an overkill, and most joins have very few > conditions anyway. > > > 2) only equality join clauses > > The other restriction is that at the moment this only supports simple > equality clauses, combined with AND. So for example this is supported > > t1 JOIN t2 ON ((t1.a = t2.a) AND (t1.b + 2 = t2.b + 1)) > > while these are not: > > t1 JOIN t2 ON ((t1.a = t2.a) OR (t1.b + 2 = t2.b + 1)) > > t1 JOIN t2 ON ((t1.a - t2.a = 0) AND (t1.b + 2 = t2.b + 1)) > > t1 JOIN t2 ON ((t1.a = t2.a) AND ((t1.b = t2.b) OR (t1.c = t2.c))) > > I'm not entirely sure these restrictions can be relaxed. It's not that > difficult to evaluate these cases when matching items between the MCV > lists, similarly to how we evaluate bitmaps for baserel estimates. > > But I'm not sure what to do about the part not covered by the MCV lists. > The eqjoinsel() approach uses ndistinct estimates for that, but that > only works for AND clauses, I think. How would that work for OR? > > Similarly, I'm not sure we can do much for non-equality conditions, but > those are currently estimated as default selectivity in selfuncs.c. > > > 3) estimation by join pairs > > At the moment, the estimates are calculated for pairs of relations, so > for example given a query > > explain analyze > select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b) > join t3 on (t1.b = t3.b and t2.c = t3.c); > > we'll estimate the first join (t1,t2) just fine, but then the second > join actually combines (t1,t2,t3). What the patch currently does is it > splits it into (t1,t2) and (t2,t3) and estimates those. I wonder if this > should actually combine all three MCVs at once - we're pretty much > combining the MCVs into one large MCV representing the join result. > > But I haven't done that yet, as it requires the MCVs to be combined > using the join clauses (overlap in a way), but I'm not sure how likely > that is in practice. In the example it could help, but that's a bit > artificial example. > > > 4) still just inner equi-joins > > I haven't done any work on extending this to outer joins etc. Adding > outer and semi joins should not be complicated, mostly copying and > tweaking what eqjoinsel() does. > > > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > Hi, + conditions2 = statext_determine_join_restrictions(root, rel, mcv); + + /* if the new statistics covers more conditions, use it */ + if (list_length(conditions2) > list_length(conditions1)) + { + mcv = stat; It seems conditions2 is calculated using mcv, I wonder why mcv is replaced by stat (for conditions1 whose length is shorter) ? Cheers