Re: using extended statistics to improve join estimates

Tomas Vondra Mon, 13 Dec 2021 05:21:06 -0800

On 11/22/21 02:23, Justin Pryzby wrote:

Your regression tests include two errors, which appear to be accidental, and
fixing the error shows that this case is being estimated poorly.


+-- try combining with single-column (and single-expression) statistics
+DROP STATISTICS join_test_2;
+ERROR:  statistics object "join_test_2" does not exist
...
+ERROR:  statistics object "join_stats_2" already exists


D'oh, what a silly mistake ...

You're right fixing the DROP STATISTICS results in worse estimate, butthat's actually expected for a fairly simple reason. The join conditionhas expressions on both sides, and dropping the statistics means wedon't have any MCV for the join_test_2 side. So the optimizer ends upnot using the regular estimates, as if there were no extended stats.

A couple lines later the script creates an extended statistics on thatexpression alone, which fixes this. An expression index would do thetrick too.

Attached is a patch fixing the test and also the issue reported byZhihong Yu some time ago.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From 616f7f3faa818ea89c4c1cecb9aa50dd9e4fe8e7 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.von...@postgresql.org>
Date: Mon, 13 Dec 2021 14:05:17 +0100
Subject: [PATCH] Estimate joins using extended statistics

Use extended statistics (MCV) to improve join estimates. In general this
is similar to how we use regular statistics - we search for extended
statistics (with MCV) covering all join clauses, and if we find such MCV
on both sides of the join, we combine those two MCVs.

Extended statistics allow a couple additional improvements - e.g. if
there are baserel conditions, we can use them to restrict the part of
the MCVs combined. This means we're building conditional probability
distribution and calculating conditional probability

    P(join clauses | baserel conditions)

instead of just P(join clauses).

The patch also allows combining regular and extended MCV - we don't need
extended MCVs on both sides. This helps when one of the tables does not
have extended statistics (e.g. because there are no correlations).
---
 src/backend/optimizer/path/clausesel.c        |  63 +-
 src/backend/statistics/extended_stats.c       | 805 ++++++++++++++++++
 src/backend/statistics/mcv.c                  | 754 ++++++++++++++++
 .../statistics/extended_stats_internal.h      |  20 +
 src/include/statistics/statistics.h           |  12 +
 src/test/regress/expected/stats_ext.out       | 167 ++++
 src/test/regress/sql/stats_ext.sql            |  66 ++
 7 files changed, 1886 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index d263ecf082..709e92446b 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -50,6 +50,9 @@ static Selectivity clauselist_selectivity_or(PlannerInfo *root,
 											 JoinType jointype,
 											 SpecialJoinInfo *sjinfo,
 											 bool use_extended_stats);
+static inline bool treat_as_join_clause(PlannerInfo *root,
+										Node *clause, RestrictInfo *rinfo,
+										int varRelid, SpecialJoinInfo *sjinfo);
 
 /****************************************************************************
  *		ROUTINES TO COMPUTE SELECTIVITIES
@@ -129,12 +132,53 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	bool		single_clause_optimization = true;
+
+	/*
+	 * The optimization of skipping to clause_selectivity_ext for single
+	 * clauses means we can't improve join estimates with a single join
+	 * clause but additional baserel restrictions. So we disable it when
+	 * estimating joins.
+	 *
+	 * XXX Not sure if this is the right way to do it, but more elaborate
+	 * checks would mostly negate the whole point of the optimization.
+	 * The (Var op Var) patch has the same issue.
+	 *
+	 * XXX An alternative might be making clause_selectivity_ext smarter
+	 * and make it use the join extended stats there. But that seems kinda
+	 * against the whole point of the optimization (skipping expensive
+	 * stuff) and it's making other parts more complex.
+	 *
+	 * XXX Maybe this should check if there are at least some restrictions
+	 * on some base relations, which seems important. But then again, that
+	 * seems to go against the idea of this check to be cheap. Moreover, it
+	 * won't work for OR clauses, which may have multiple parts but we still
+	 * see them as a single BoolExpr clause (it doesn't work later, though).
+	 */
+	if (list_length(clauses) == 1)
+	{
+		Node *clause = linitial(clauses);
+		RestrictInfo *rinfo = NULL;
+
+		if (IsA(clause, RestrictInfo))
+		{
+			rinfo = (RestrictInfo *) clause;
+			clause = (Node *) rinfo->clause;
+		}
+
+		single_clause_optimization
+			= !treat_as_join_clause(root, clause, rinfo, varRelid, sjinfo);
+	}
 
 	/*
 	 * If there's exactly one clause, just go directly to
 	 * clause_selectivity_ext(). None of what we might do below is relevant.
+	 *
+	 * XXX This means we won't try using extended stats on OR-clauses (which
+	 * are a single BoolExpr clause at this point), although we'll do that
+	 * later (once we look at the arguments).
 	 */
-	if (list_length(clauses) == 1)
+	if ((list_length(clauses) == 1) && single_clause_optimization)
 		return clause_selectivity_ext(root, (Node *) linitial(clauses),
 									  varRelid, jointype, sjinfo,
 									  use_extended_stats);
@@ -157,6 +201,23 @@ clauselist_selectivity_ext(PlannerInfo *root,
 											&estimatedclauses, false);
 	}
 
+	/*
+	 * Try applying extended statistics to joins. There's not much we can
+	 * do to detect when this makes sense, but we can check that there are
+	 * join clauses, and that at least some of the rels have stats.
+	 *
+	 * XXX Isn't this mutualy exclusive with the preceding block which
+	 * calculates estimates for a single relation?
+	 */
+	if (use_extended_stats &&
+		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo,
+						 estimatedclauses))
+	{
+		s1 *= statext_clauselist_join_selectivity(root, clauses, varRelid,
+												  jointype, sjinfo,
+												  &estimatedclauses);
+	}
+
 	/*
 	 * Apply normal selectivity estimates for remaining clauses. We'll be
 	 * careful to skip any clauses which were already estimated above.
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 69ca52094f..ce6a62d944 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -30,6 +30,7 @@
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/optimizer.h"
+#include "optimizer/pathnode.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "statistics/extended_stats_internal.h"
@@ -101,6 +102,8 @@ static StatsBuildData *make_build_data(Relation onerel, StatExtEntry *stat,
 									   int numrows, HeapTuple *rows,
 									   VacAttrStats **stats, int stattarget);
 
+static bool stat_covers_expressions(StatisticExtInfo *stat, List *exprs,
+									Bitmapset **expr_idxs);
 
 /*
  * Compute requested extended stats, using the rows sampled for the plain
@@ -2608,3 +2611,805 @@ make_build_data(Relation rel, StatExtEntry *stat, int numrows, HeapTuple *rows,
 
 	return result;
 }
+
+/*
+ * statext_find_matching_mcv
+ *		Search for a MCV covering all the attributes and expressions.
+ *
+ * We pick the statistics to use for join estimation. The statistics has to
+ * be an MCV, and we require it to match all the join conditions, because it
+ * makes the estimation simpler.
+ *
+ * If there are multiple candidate statistics (matching all join clauses),
+ * we pick the smallest one, and we also consider additional conditions on
+ * the base relations to restrict the MCV items used for estimation (using
+ * conditional probability).
+ *
+ * XXX The requirement that all the attributes need to be covered might be
+ * too strong. We could relax this and and require fewer matches (at least two,
+ * if counting the additional conditions), and we might even apply multiple
+ * statistics etc. But that would require matching statistics on both sides of
+ * the join, while now we simply know the statistics match. We don't really
+ * expect many candidate MCVs, so this simple approach seems sufficient. And
+ * the joins usually use only one or two columns, so there's not much room
+ * for applying multiple statistics anyway.
+ */
+StatisticExtInfo *
+statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
+						  Bitmapset *attnums, List *exprs)
+{
+	ListCell   *l;
+	StatisticExtInfo *mcv = NULL;
+	List *stats = rel->statlist;
+
+	foreach(l, stats)
+	{
+		StatisticExtInfo *stat = (StatisticExtInfo *) lfirst(l);
+		List *conditions1 = NIL,
+			 *conditions2 = NIL;
+
+		/* We only care about MCV statistics here. */
+		if (stat->kind != STATS_EXT_MCV)
+			continue;
+
+		/*
+		 * Ignore MCVs not covering all the attributes/expressions.
+		 *
+		 * XXX Maybe we shouldn't be so strict and consider only partial
+		 * matches for join clauses too?
+		 */
+		if (!bms_is_subset(attnums, stat->keys) ||
+			!stat_covers_expressions(stat, exprs, NULL))
+			continue;
+
+		/* If there's no matching MCV yet, keep this one. */
+		if (!mcv)
+		{
+			mcv = stat;
+			continue;
+		}
+
+		/*
+		 * OK, we have two candidate statistics and we need to decide which one
+		 * to keep. We'll use two simple heuristics:
+		 *
+		 * (a) We prefer smaller statistics (fewer columns), on the assumption
+		 * that it represents a larger fraction of the data (due to having fewer
+		 * combinations with higher counts).
+		 *
+		 * (b) If the statistics covers some additional conditions for the rels,
+		 * that may help with considering additional dependencies between the
+		 * tables.
+		 *
+		 * Of course, those two heuristict are somewhat contradictory - smaller
+		 * stats are less likely to cover as many conditions as a larger one. We
+		 * consider the additional conditions first - if someone created such
+		 * statistics, there probably is a dependency worth considering.
+		 *
+		 * When inspecting the restrictions, we need to be careful - we don't
+		 * know which of them are compatible with extended stats, so we have to
+		 * run them through statext_is_compatible_clause first and then match
+		 * them.to the statistics.
+		 *
+		 * XXX Maybe we shouldn't pick statistics that covers just a single join
+		 * clause, without any additional conditions. In such case we could just
+		 * as well pick regular statistics for the column/expression, but it's
+		 * not clear if that actually exists (so we might reject the stats here
+		 * and then fail to find something simpler/better).
+		 */
+		conditions1 = statext_determine_join_restrictions(root, rel, stat);
+		conditions2 = statext_determine_join_restrictions(root, rel, mcv);
+
+		/* if the new statistics covers more conditions, use it */
+		if (list_length(conditions1) > list_length(conditions2))
+		{
+			mcv = stat;
+			continue;
+		}
+
+		/* The statistics seem about equal, so just use the smaller one. */
+		if (bms_num_members(mcv->keys) + list_length(mcv->exprs) >
+			bms_num_members(stat->keys) + list_length(stat->exprs))
+		{
+			mcv = stat;
+		}
+	}
+
+	return mcv;
+}
+
+/*
+ * statext_determine_join_restrictions
+ *		Get restrictions on base relation, covered by the statistics.
+ *
+ * Returns a list of baserel restrictinfos, compatible with extended statistics
+ * and covered by the extended statistics.
+ *
+ * When using extended statistics to estimate joins, we can use conditions
+ * from base relations to calculate conditional probability
+ *
+ *    P(join clauses | baserel restrictions)
+ *
+ * which should be a better estimate than just P(join clauses). We want to pick
+ * the statistics covering the most such conditions.
+ */
+List *
+statext_determine_join_restrictions(PlannerInfo *root, RelOptInfo *rel,
+									StatisticExtInfo *info)
+{
+	ListCell   *lc;
+	List	   *conditions = NIL;
+
+	/* extract conditions that may be applied to the MCV list */
+	foreach (lc, rel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+		Bitmapset *indexes = NULL;
+		Bitmapset *attnums = NULL;
+		List *exprs = NIL;
+
+		/* clause has to be supported by MCV in general */
+		if (!statext_is_compatible_clause(root, (Node *) rinfo, rel->relid,
+										  &attnums, &exprs))
+			continue;
+
+		/*
+		 * clause is compatible in general, but is it actually covered
+		 * by this partiular statistics object?
+		 */
+		if (!bms_is_subset(attnums, info->keys) ||
+			!stat_covers_expressions(info, exprs, &indexes))
+			continue;
+
+		conditions = lappend(conditions, rinfo->clause);
+	}
+
+	return conditions;
+}
+
+/*
+ * statext_is_supported_join_clause
+ *		Check if a join clause may be estimated using extended stats.
+ *
+ * Determines if this is a join clause of the form (Expr op Expr) which may be
+ * estimated using extended statistics. Each side must reference just a single
+ * relation for now.
+ *
+ * Similar to treat_as_join_clause, but we place additional restrictions
+ * on the conditions.
+ */
+static bool
+statext_is_supported_join_clause(PlannerInfo *root, Node *clause,
+								 int varRelid, SpecialJoinInfo *sjinfo)
+{
+	Oid	oprsel;
+	RestrictInfo   *rinfo;
+	OpExpr		   *opclause;
+	ListCell	   *lc;
+
+	/*
+	 * evaluation as a restriction clause, either at scan node or forced
+	 *
+	 * XXX See treat_as_join_clause.
+	 */
+	if ((varRelid != 0) || (sjinfo == NULL))
+		return false;
+
+	/* XXX Can we rely on always getting RestrictInfo here? */
+	if (!IsA(clause, RestrictInfo))
+		return false;
+
+	/* strip the RestrictInfo */
+	rinfo = (RestrictInfo *) clause;
+	clause = (Node *) rinfo->clause;
+
+	/* is it referencing multiple relations? */
+	if (bms_membership(rinfo->clause_relids) != BMS_MULTIPLE)
+		return false;
+
+	/* we only support simple operator clauses for now */
+	if (!is_opclause(clause))
+		return false;
+
+	opclause = (OpExpr *) clause;
+
+	/* for now we only support estimating equijoins */
+	oprsel = get_oprjoin(opclause->opno);
+
+	/* has to be an equality condition */
+	if (oprsel != F_EQJOINSEL)
+		return false;
+
+	/*
+	 * Make sure we're not mixing vars from multiple relations on the same
+	 * side, like
+	 *
+	 *   (t1.a + t2.a) = (t1.b + t2.b)
+	 *
+	 * which is still technically an opclause, but we can't match it to
+	 * extended statistics in a simple way.
+	 *
+	 * XXX This also means we require rinfo->clause_relids to have 2 rels.
+	 *
+	 * XXX Also check it's not expression on system attributes, which we
+	 * don't allow in extended statistics.
+	 *
+	 * XXX Although maybe we could allow cases that combine expressions
+	 * from both relations on either side? Like (t1.a + t2.b = t1.c - t2.d)
+	 * or something like that. We could do "cartesian product" of the MCV
+	 * stats and restrict it using this condition.
+	 */
+	foreach (lc, opclause->args)
+	{
+		Bitmapset *varnos = NULL;
+		Node *expr = (Node *) lfirst(lc);
+
+		varnos = pull_varnos(root, expr);
+
+		/*
+		 * No argument should reference more than just one relation.
+		 *
+		 * This effectively means each side references just two relations.
+		 * If there's no relation on one side, it's a Const, and the other
+		 * side has to be either Const or Expr with a single rel. In which
+		 * case it can't be a join clause.
+		 */
+		if (bms_num_members(varnos) > 1)
+			return false;
+
+		/*
+		 * XXX Maybe check that both relations have extended statistics
+		 * (no point in considering the clause as useful without it). But
+		 * we'll do that check later anyway, so keep this cheap.
+		 */
+	}
+
+	return true;
+}
+
+/*
+ * statext_try_join_estimates
+ *		Checks if it's worth considering extended stats on join estimates.
+ *
+ * This is supposed to be a quick/cheap check to decide whether to expend
+ * more effort on applying extended statistics to join clauses.
+ */
+bool
+statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
+						   JoinType jointype, SpecialJoinInfo *sjinfo,
+						   Bitmapset *estimatedclauses)
+{
+	int			listidx;
+	int			k;
+	ListCell   *lc;
+	Bitmapset  *relids = NULL;
+
+	/*
+	 * XXX Not having these values means treat_as_join_clause returns false,
+	 * so we're not supposed to handle join clauses here. So just bail out.
+	 */
+	if ((varRelid != 0) || (sjinfo == NULL))
+		return false;
+
+	/*
+	 * Check if there are any unestimated join clauses, collect relids.
+	 *
+	 * XXX Currently this only allows simple OpExpr equality clauses with each
+	 * argument refering to single relation, AND-ed together. Maybe we could
+	 * relax this in the future, e.g. to allow more complex (deeper) expressions
+	 * and to allow OR-ed join clauses too. And maybe supporting inequalities.
+	 *
+	 * Handling more complex expressions seems simple - we already do that for
+	 * baserel estimates by building the match bitmap recursively, and we could
+	 * do something similar for combinations of MCV items (a bit like building
+	 * a single bit in the match bitmap). The challenge is what to do about the
+	 * part not represented by MCV, which is now based on ndistinct estimates.
+	 */
+	listidx = -1;
+	foreach (lc, clauses)
+	{
+		Node *clause = (Node *) lfirst(lc);
+		RestrictInfo *rinfo;
+
+		/* needs to happen before skipping any clauses */
+		listidx++;
+
+		/* Skip clauses that were already estimated. */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		/*
+		 * Skip clauses that are not join clauses or that we don't know
+		 * how to handle estimate using extended statistics.
+		 */
+		if (!statext_is_supported_join_clause(root, clause, varRelid, sjinfo))
+			continue;
+
+		/*
+		 * XXX We're guaranteed to have RestrictInfo thanks to the checks
+		 * in statext_is_supported_join_clause.
+		 */
+		rinfo = (RestrictInfo *) clause;
+
+		/* Collect relids from all usable clauses. */
+		relids = bms_union(relids, rinfo->clause_relids);
+	}
+
+	/* no join clauses found, don't try applying extended stats */
+	if (bms_num_members(relids) == 0)
+		return false;
+
+	/*
+	 * We expect either 0 or >= 2 relids, a case with 1 relid in join clauses
+	 * should be impossible. And we just ruled out 0, so there are at least 2.
+	 */
+	Assert(bms_num_members(relids) >= 2);
+
+	/*
+	 * Check that at least some of the rels referenced by the clauses have
+	 * extended stats.
+	 *
+	 * XXX Maybe we should check how many rels have stats, and cross-check how
+	 * compatible they are (e.g. that both have MCVs, etc.). We might also
+	 * cross-check the exact joined pairs of rels, but it's is supposed to be
+	 * a cheap check, so maybe better leave that for later.
+	 *
+	 * XXX We could also check the number of parameters in each rel to consider
+	 * extended stats. If there's just a single attribute, it's pointless to use
+	 * extended statistics. OTOH we can also consider restriction clauses from
+	 * baserestrictinfo and use them to calculate conditional probabilities.
+	 */
+	k = -1;
+	while ((k = bms_next_member(relids, k)) >= 0)
+	{
+		RelOptInfo *rel = find_base_rel(root, k);
+		if (rel->statlist)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Information about two joined relations, along with the join clauses between.
+ */
+typedef struct JoinPairInfo
+{
+	Bitmapset  *rels;
+	List	   *clauses;
+} JoinPairInfo;
+
+/*
+ * statext_build_join_pairs
+ *		Extract pairs of joined rels with join clauses for each pair.
+ *
+ * Walks the remaining (not yet estimated) clauses, and splits them into
+ * lists for each pair of joined relations. Returns NULL if there are no
+ * suitable join pairs that might be estimated using extended stats.
+ *
+ * XXX It's possible there are join clauses, but the clauses are not
+ * supported by the extended stats machinery (we only support opclauses
+ * with F_EQJOINSEL selectivity function at the moment).
+ */
+static JoinPairInfo *
+statext_build_join_pairs(PlannerInfo *root, List *clauses, int varRelid,
+						 JoinType jointype, SpecialJoinInfo *sjinfo,
+						 Bitmapset *estimatedclauses, int *npairs)
+{
+	int				cnt;
+	int				listidx;
+	JoinPairInfo   *info;
+	ListCell	   *lc;
+
+	/*
+	 * Assume each clause is for a different pair of relations (some of them
+	 * might be already estimated, but meh - there shouldn't be too many of
+	 * them and it's cheaper than repalloc.
+	 */
+	info = (JoinPairInfo *) palloc0(sizeof(JoinPairInfo) * list_length(clauses));
+	cnt = 0;
+
+	listidx = -1;
+	foreach(lc, clauses)
+	{
+		int				i;
+		bool			found;
+		Node		   *clause = (Node *) lfirst(lc);
+		RestrictInfo   *rinfo;
+
+		listidx++;
+
+		/* skip already estimated clauses */
+		if (bms_is_member(listidx, estimatedclauses))
+			continue;
+
+		/*
+		 * Make sure the clause is a join clause of a supported shape (at
+		 * the moment we support just (Expr op Expr) clauses with each
+		 * side referencing just a single relation.
+		 */
+		if (!statext_is_supported_join_clause(root, clause, varRelid, sjinfo))
+			continue;
+
+		/* statext_is_supported_join_clause guarantees RestrictInfo */
+		rinfo = (RestrictInfo *) clause;
+		clause = (Node *) rinfo->clause;
+
+		/* search for a matching join pair */
+		found = false;
+		for (i = 0; i < cnt; i++)
+		{
+			if (bms_is_subset(rinfo->clause_relids, info[i].rels))
+			{
+				info[i].clauses = lappend(info[i].clauses, clause);
+				found = true;
+				break;
+			}
+		}
+
+		if (!found)
+		{
+			info[cnt].rels = rinfo->clause_relids;
+			info[cnt].clauses = lappend(info[cnt].clauses, clause);
+			cnt++;
+		}
+	}
+
+	if (cnt == 0)
+		return NULL;
+
+	*npairs = cnt;
+	return info;
+}
+
+/*
+ * extract_relation_info
+ *		Extract information about a relation in a join pair.
+ *
+ * The relation is identified by index (generally 0 or 1), and picks extended
+ * statistics covering the join clauses and baserel restrictions.
+ *
+ * XXX Can we have cases with indexes above 1? Probably for clauses mixing
+ * vars from 3 relations, but statext_is_supported_join_clause rejects those.
+ */
+static RelOptInfo *
+extract_relation_info(PlannerInfo *root, JoinPairInfo *info, int index,
+					  StatisticExtInfo **stat)
+{
+	int			k;
+	int			relid;
+	RelOptInfo *rel;
+	ListCell   *lc;
+	List	   *exprs = NIL;
+
+	Bitmapset  *attnums = NULL;
+
+	Assert((index >= 0) && (index <= 1));
+
+	k = -1;
+	while (index >= 0)
+	{
+		k = bms_next_member(info->rels, k);
+		if (k < 0)
+			elog(ERROR, "failed to extract relid");
+
+		relid = k;
+		index--;
+	}
+
+	rel = find_base_rel(root, relid);
+
+	/*
+	 * Walk the clauses for this join pair, and extract expressions about
+	 * the relation identified by index / relid. For simple Vars we extract
+	 * the attnum. Otherwise we keep the whole expression.
+	 */
+	foreach (lc, info->clauses)
+	{
+		ListCell *lc2;
+		Node *clause = (Node *) lfirst(lc);
+		OpExpr *opclause = (OpExpr *) clause;
+
+		/* only opclauses supported for now */
+		Assert(is_opclause(clause));
+
+		foreach (lc2, opclause->args)
+		{
+			Node *arg = (Node *) lfirst(lc2);
+			Bitmapset *varnos = NULL;
+
+			/* plain Var references (boolean Vars or recursive checks) */
+			if (IsA(arg, Var))
+			{
+				Var		   *var = (Var *) arg;
+
+				/* Ignore vars from other relations. */
+				if (var->varno != relid)
+					continue;
+
+				/* we also better ensure the Var is from the current level */
+				if (var->varlevelsup > 0)
+					continue;
+
+				/* Also skip system attributes (we don't allow stats on those). */
+				if (!AttrNumberIsForUserDefinedAttr(var->varattno))
+					elog(ERROR, "unexpected system attribute");
+
+				attnums = bms_add_member(attnums, var->varattno);
+
+				/* Done, process the next argument. */
+				continue;
+			}
+
+			/*
+			 * OK, it's a more complex expression, so check if it matches
+			 * the relid and maybe keep it as a whole. It should be
+			 * compatible because we already checked it when building the
+			 * join pairs.
+			 */
+			varnos = pull_varnos(root, arg);
+
+			if (relid == bms_singleton_member(varnos))
+				exprs = lappend(exprs, arg);
+		}
+	}
+
+	*stat = statext_find_matching_mcv(root, rel, attnums, exprs);
+
+	return rel;
+}
+
+/*
+ * get_expression_for_rel
+ *		Extract expression for a given relation from the join clause.
+ *
+ * Given a join clause supported by the extended statistics (currently that
+ * means just OpExpr clauses with each argument referencing single rel),
+ * return either the left or right argument expression for the rel.
+ *
+ * XXX This should probably return a flag identifying whether it's the
+ * left or right argument.
+ */
+static Node *
+get_expression_for_rel(PlannerInfo *root, RelOptInfo *rel, Node *clause)
+{
+	OpExpr *opexpr;
+	Node   *expr;
+
+	/*
+	 * Strip the RestrictInfo node, get the actual clause.
+	 *
+	 * XXX Not sure if we need to care about removing other node types
+	 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
+	 * matches this, but maybe we need to relax it?
+	 */
+	if (IsA(clause, RestrictInfo))
+		clause = (Node *) ((RestrictInfo *) clause)->clause;
+
+	opexpr = (OpExpr *) clause;
+
+	/* Make sure we have the expected node type. */
+	Assert(is_opclause(clause));
+	Assert(list_length(opexpr->args) == 2);
+
+	/* FIXME strip relabel etc. the way examine_opclause_args does */
+	expr = linitial(opexpr->args);
+	if (bms_singleton_member(pull_varnos(root, expr)) == rel->relid)
+		return expr;
+
+	expr = lsecond(opexpr->args);
+	if (bms_singleton_member(pull_varnos(root, expr)) == rel->relid)
+		return expr;
+
+	return NULL;
+}
+
+/*
+ * statext_clauselist_join_selectivity
+ *		Use extended stats to estimate join clauses.
+ *
+ * XXX In principle, we should not restrict this to cases with multiple
+ * join clauses - we should consider dependencies with conditions at the
+ * base relations, i.e. calculate P(join clause | base restrictions).
+ * But currently that does not happen, because clauselist_selectivity_ext
+ * treats a single clause as a special case (and we don't apply extended
+ * statistics in that case yet).
+ */
+Selectivity
+statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses, int varRelid,
+									JoinType jointype, SpecialJoinInfo *sjinfo,
+									Bitmapset **estimatedclauses)
+{
+	int			i;
+	int			listidx;
+	Selectivity	s = 1.0;
+
+	JoinPairInfo *info;
+	int				ninfo;
+
+	if (!clauses)
+		return 1.0;
+
+	/* extract pairs of joined relations from the list of clauses */
+	info = statext_build_join_pairs(root, clauses, varRelid, jointype, sjinfo,
+									*estimatedclauses, &ninfo);
+
+	/* no useful join pairs */
+	if (!info)
+		return 1.0;
+
+	/*
+	 * Process the join pairs, try to find a matching MCV on each side.
+	 *
+	 * XXX The basic principle is quite similar to eqjoinsel_inner, i.e.
+	 * we try to find a MCV on both sides of the join, and use it to get
+	 * better join estimate. It's a bit more complicated, because there
+	 * might be multiple MCV lists, we also need ndistinct estimate, and
+	 * there may be interesting baserestrictions too.
+	 *
+	 * XXX At the moment we only handle the case with matching MCVs on
+	 * both sides, but it'd be good to also handle case with just ndistinct
+	 * statistics improving ndistinct estimates.
+	 *
+	 * XXX We might also handle cases with a regular MCV on one side and
+	 * an extended MCV on the other side.
+	 *
+	 * XXX Perhaps it'd be good to also handle case with one side only
+	 * having "regular" statistics (e.g. MCV), especially in cases with
+	 * no conditions on that side of the join (where we can't use the
+	 * extended MCV to calculate conditional probability).
+	 */
+	for (i = 0; i < ninfo; i++)
+	{
+		ListCell *lc;
+
+		RelOptInfo *rel1;
+		RelOptInfo *rel2;
+
+		StatisticExtInfo *stat1;
+		StatisticExtInfo *stat2;
+
+		/* extract info about the first relation */
+		rel1 = extract_relation_info(root, &info[i], 0, &stat1);
+
+		/* extract info about the second relation */
+		rel2 = extract_relation_info(root, &info[i], 1, &stat2);
+
+		/*
+		 * We can handle three basic cases:
+		 *
+		 * a) Extended stats (with MCV) on both sides is an ideal case, and we
+		 * can simply combine the two MCVs, possibly with additional conditions
+		 * from the relations.
+		 *
+		 * b) Extended stats on one side, regular MCV on the other side (this
+		 * means there's just one join clause / expression). It also means the
+		 * extended stats likely covers at least one extra condition, otherwise
+		 * we could just use regular statistics. We can combine the stats just
+		 * similarly to (a).
+		 *
+		 * c) No extended stats with MCV. If there are multiple join clauses,
+		 * we can try using ndistinct coefficients and do what esjoinsel does.
+		 *
+		 * If none of these applies, we fallback to the regular selectivity
+		 * estimation in eqjoinsel.
+		 */
+		if (stat1 && stat2)
+		{
+			s *= mcv_combine_extended(root, rel1, rel2, stat1, stat2, info[i].clauses);
+		}
+		else if (stat1 && (list_length(info[i].clauses) == 1))
+		{
+			/* try finding MCV on the other relation */
+			VariableStatData	vardata;
+			AttStatsSlot		sslot;
+			Form_pg_statistic	stats = NULL;
+			bool				have_mcvs = false;
+			Node			   *clause = linitial(info[i].clauses);
+			Node			   *expr = get_expression_for_rel(root, rel2, clause);
+			double				nd;
+			bool				isdefault;
+
+			examine_variable(root, expr, 0, &vardata);
+
+			nd = get_variable_numdistinct(&vardata, &isdefault);
+
+			memset(&sslot, 0, sizeof(sslot));
+
+			if (HeapTupleIsValid(vardata.statsTuple))
+			{
+				/* note we allow use of nullfrac regardless of security check */
+				stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
+				/* FIXME should this call statistic_proc_security_check like eqjoinsel? */
+				have_mcvs = get_attstatsslot(&sslot, vardata.statsTuple,
+											 STATISTIC_KIND_MCV, InvalidOid,
+											 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
+			}
+
+			if (have_mcvs)
+				s *= mcv_combine_simple(root, rel1, stat1, &sslot,
+										stats->stanullfrac, nd, isdefault, clause);
+
+			free_attstatsslot(&sslot);
+
+			ReleaseVariableStats(vardata);
+
+			/* no stats, don't mark the clauses as estimated */
+			if (!have_mcvs)
+				continue;
+		}
+		else if (stat2 && (list_length(info[i].clauses) == 1))
+		{
+			/* try finding MCV on the other relation */
+			VariableStatData	vardata;
+			AttStatsSlot		sslot;
+			Form_pg_statistic	stats = NULL;
+			bool				have_mcvs = false;
+			Node			   *clause = (Node *) linitial(info[i].clauses);
+			Node			   *expr = get_expression_for_rel(root, rel1, clause);
+			double				nd;
+			bool				isdefault;
+
+			examine_variable(root, expr, 0, &vardata);
+
+			nd = get_variable_numdistinct(&vardata, &isdefault);
+
+			memset(&sslot, 0, sizeof(sslot));
+
+			if (HeapTupleIsValid(vardata.statsTuple))
+			{
+				/* note we allow use of nullfrac regardless of security check */
+				stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
+				/* FIXME should this call statistic_proc_security_check like eqjoinsel? */
+				have_mcvs = get_attstatsslot(&sslot, vardata.statsTuple,
+											 STATISTIC_KIND_MCV, InvalidOid,
+											 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
+			}
+
+			if (have_mcvs)
+				s *= mcv_combine_simple(root, rel2, stat2, &sslot,
+										stats->stanullfrac, nd, isdefault, clause);
+
+			free_attstatsslot(&sslot);
+
+			ReleaseVariableStats(vardata);
+
+			/* no stats, don't mark the clauses as estimated */
+			if (!have_mcvs)
+				continue;
+		}
+		else
+			continue;
+
+		/*
+		 * Now mark all the clauses for this join pair as estimated.
+		 *
+		 * XXX Maybe track the indexes in JoinPairInfo, so that we can
+		 * simply union the two bitmaps, without the extra matching.
+		 */
+		foreach (lc, info->clauses)
+		{
+			Node *clause = (Node *) lfirst(lc);
+			ListCell *lc2;
+
+			listidx = -1;
+			foreach (lc2, clauses)
+			{
+				Node *clause2 = (Node *) lfirst(lc2);
+				listidx++;
+
+				Assert(IsA(clause2, RestrictInfo));
+
+				clause2 = (Node *) ((RestrictInfo *) clause2)->clause;
+
+				if (equal(clause, clause2))
+				{
+					*estimatedclauses = bms_add_member(*estimatedclauses, listidx);
+					break;
+				}
+			}
+		}
+	}
+
+	return s;
+}
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index b350fc5f7b..b0e877c92e 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -24,6 +24,7 @@
 #include "funcapi.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
+#include "optimizer/optimizer.h"
 #include "statistics/extended_stats_internal.h"
 #include "statistics/statistics.h"
 #include "utils/array.h"
@@ -2156,3 +2157,756 @@ mcv_clause_selectivity_or(PlannerInfo *root, StatisticExtInfo *stat,
 
 	return s;
 }
+
+/*
+ * statext_compare_mcvs
+ *		Calculte join selectivity using extended statistics, similarly to
+ *		eqjoinsel_inner.
+ *
+ * Considers restrictions on base relations too, essentially computing a
+ * conditional probability
+ *
+ *	P(join clauses | baserestrictinfos on either side)
+ *
+ * Compared to eqjoinsel_inner there's a couple problems. With per-column MCV
+ * lists it's obvious that the number of distinct values not covered by the MCV
+ * is (ndistinct - size(MCV)). With multi-column MCVs it's not that simple,
+ * particularly when the conditions are on a subset of the MCV attributes and/or
+ * NULLs are involved. E.g. with MCV (a,b,c) and conditions on (a,b), it's not
+ * clear if the number of (a,b) combinations not covered by the MCV is
+ *
+ * (ndistinct(a,b) - ndistinct_mcv(a,b))
+ *
+ * where ndistinct_mcv(a,b) is the number of distinct (a,b) combinations
+ * included in the MCV list. These combinations may be present in the rest
+ * of the data (outside MCV), just with some extra values in "c". So in
+ * principle there may be between
+ *
+ * (ndistinct(a,b) - ndistinct_mcv(a,b)) and ndistinct(a,b)
+ *
+ * distinct values in the part of the data not covered by the MCV. So we need
+ * to pick something in between, there's no way to calculate this accurately.
+ */
+Selectivity
+mcv_combine_extended(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
+					 StatisticExtInfo *stat1, StatisticExtInfo *stat2,
+					 List *clauses)
+{
+	ListCell   *lc;
+
+	MCVList    *mcv1,
+			   *mcv2;
+	int			idx,
+				i,
+				j;
+	Selectivity s = 0;
+
+	/* match bitmaps and selectivity for baserel conditions (if any) */
+	List   *exprs1 = NIL,
+		   *exprs2 = NIL;
+	List   *conditions1 = NIL,
+		   *conditions2 = NIL;
+	bool   *cmatches1 = NULL,
+		   *cmatches2 = NULL;
+
+	double	csel1 = 1.0,
+			csel2 = 1.0;
+
+	bool   *matches1 = NULL,
+		   *matches2 = NULL;
+
+	/* estimates for the two relations */
+	double	matchfreq1,
+			unmatchfreq1,
+			otherfreq1,
+			mcvfreq1,
+			nd1,
+			totalsel1;
+
+	double 	matchfreq2,
+			unmatchfreq2,
+			otherfreq2,
+			mcvfreq2,
+			nd2,
+			totalsel2;
+
+	/* info about clauses and how they match to MCV stats */
+	FmgrInfo   *opprocs;
+	int		   *indexes1,
+			   *indexes2;
+	bool	   *reverse;
+
+	/* we picked the stats so that they have MCV enabled */
+	Assert((stat1->kind = STATS_EXT_MCV) && (stat2->kind = STATS_EXT_MCV));
+
+	mcv1 = statext_mcv_load(stat1->statOid);
+	mcv2 = statext_mcv_load(stat2->statOid);
+
+	/* should only get here with MCV on both sides */
+	Assert(mcv1 && mcv2);
+
+	/* Determine which baserel clauses to use for conditional probability. */
+	conditions1 = statext_determine_join_restrictions(root, rel1, stat1);
+	conditions2 = statext_determine_join_restrictions(root, rel2, stat2);
+
+	/*
+	 * Calculate match bitmaps for restrictions on either side of the join
+	 * (there may be none, in which case this will be NULL).
+	 */
+	if (conditions1)
+	{
+		cmatches1 = mcv_get_match_bitmap(root, conditions1,
+										 stat1->keys, stat1->exprs,
+										 mcv1, false);
+		csel1 = clauselist_selectivity(root, conditions1, rel1->relid, 0, NULL);
+	}
+
+	if (conditions2)
+	{
+		cmatches2 = mcv_get_match_bitmap(root, conditions2,
+										 stat2->keys, stat2->exprs,
+										 mcv2, false);
+		csel2 = clauselist_selectivity(root, conditions2, rel2->relid, 0, NULL);
+	}
+
+	/*
+	 * Match bitmaps for matches between MCV elements. By default there
+	 * are no matches, so we set all items to 0.
+	 */
+	matches1 = (bool *) palloc0(sizeof(bool) * mcv1->nitems);
+	matches2 = (bool *) palloc0(sizeof(bool) * mcv2->nitems);
+
+	/*
+	 * Initialize information about clauses and how they match to the MCV
+	 * stats we picked. We do this only once before processing the lists,
+	 * so that we don't have to do that for each MCV item or so.
+	 */
+	opprocs = (FmgrInfo *) palloc(sizeof(FmgrInfo) * list_length(clauses));
+	indexes1 = (int *) palloc(sizeof(int) * list_length(clauses));
+	indexes2 = (int *) palloc(sizeof(int) * list_length(clauses));
+	reverse = (bool *) palloc(sizeof(bool) * list_length(clauses));
+
+	idx = 0;
+	foreach (lc, clauses)
+	{
+		Node	   *clause = (Node *) lfirst(lc);
+		OpExpr	   *opexpr;
+		Node	   *expr1,
+				   *expr2;
+		Bitmapset  *relids1,
+				   *relids2;
+
+		/*
+		 * Strip the RestrictInfo node, get the actual clause.
+		 *
+		 * XXX Not sure if we need to care about removing other node types
+		 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
+		 * matches this, but maybe we need to relax it?
+		 */
+		if (IsA(clause, RestrictInfo))
+			clause = (Node *) ((RestrictInfo *) clause)->clause;
+
+		opexpr = (OpExpr *) clause;
+
+		/* Make sure we have the expected node type. */
+		Assert(is_opclause(clause));
+		Assert(list_length(opexpr->args) == 2);
+
+		fmgr_info(get_opcode(opexpr->opno), &opprocs[idx]);
+
+		/* FIXME strip relabel etc. the way examine_opclause_args does */
+		expr1 = linitial(opexpr->args);
+		expr2 = lsecond(opexpr->args);
+
+		/* determine order of clauses (rel1 op rel2) or (rel2 op rel1) */
+		relids1 = pull_varnos(root, expr1);
+		relids2 = pull_varnos(root, expr2);
+
+		if ((bms_singleton_member(relids1) == rel1->relid) &&
+			(bms_singleton_member(relids2) == rel2->relid))
+		{
+			Oid		collid;
+
+			indexes1[idx] = mcv_match_expression(expr1,
+												 stat1->keys, stat1->exprs,
+												 &collid);
+			indexes2[idx] = mcv_match_expression(expr2,
+												 stat2->keys, stat2->exprs,
+												 &collid);
+			reverse[idx] = false;
+
+			exprs1 = lappend(exprs1, expr1);
+			exprs2 = lappend(exprs2, expr2);
+		}
+		else if ((bms_singleton_member(relids2) == rel1->relid) &&
+				 (bms_singleton_member(relids1) == rel2->relid))
+		{
+			Oid		collid;
+
+			indexes1[idx] = mcv_match_expression(expr2,
+												 stat2->keys, stat2->exprs,
+												 &collid);
+			indexes2[idx] = mcv_match_expression(expr1,
+												 stat1->keys, stat1->exprs,
+												 &collid);
+			reverse[idx] = true;
+
+			exprs1 = lappend(exprs1, expr2);
+			exprs2 = lappend(exprs2, expr1);
+		}
+		else
+			/* should never happen */
+			Assert(false);
+
+		Assert((indexes1[idx] >= 0) &&
+			   (indexes1[idx] < bms_num_members(stat1->keys) + list_length(stat1->exprs)));
+
+		Assert((indexes2[idx] >= 0) &&
+			   (indexes2[idx] < bms_num_members(stat2->keys) + list_length(stat2->exprs)));
+
+		idx++;
+	}
+
+	/*
+	 * Match items between the two MCV lists.
+	 *
+	 * We don't know if the join conditions match all attributes in the MCV, the
+	 * overlap may be just on a subset  of attributes, e.g. (a,b,c) vs. (b,c,d).
+	 * So there may be multiple matches on either side. So we can't optimize by
+	 * aborting the inner loop after the first match, etc.
+	 *
+	 * XXX We can skip the items eliminated by the base restrictions, of course.
+	 *
+	 * XXX We might optimize this in two ways. We might sort the MCV items on
+	 * both sides using the "join" attributes, and then perform somthing like
+	 * merge join. Or we might calculate hash from the join columns, and then
+	 * compare this (to eliminate most expensive equality functions).
+	 */
+	for (i = 0; i < mcv1->nitems; i++)
+	{
+		bool	has_nulls;
+
+		/* skip items eliminated by restrictions on rel1 */
+		if (cmatches1 && !cmatches1[i])
+			continue;
+
+		/*
+		 * Check if any value in the first MCV item is NULL, because it'll be
+		 * mismatch anyway.
+		 *
+		 * XXX This might not work for some join clauses, e.g. IS NOT DISTINCT
+		 * FROM, but those are currently not considered compatible (we only
+		 * allow OpExpr at the moment).
+		 */
+		has_nulls = false;
+		for (j = 0; j < list_length(clauses); j++)
+			has_nulls |= mcv1->items[i].isnull[indexes1[j]];
+
+		if (has_nulls)
+			continue;
+
+		/* find matches in the second MCV list */
+		for (j = 0; j < mcv2->nitems; j++)
+		{
+			int			idx;
+			bool		items_match = true;
+
+			/* skip items eliminated by restrictions on rel2 */
+			if (cmatches2 && !cmatches2[j])
+				continue;
+
+			/*
+			 * XXX We can't skip based on existing matches2 value, because there
+			 * may be duplicates in the first MCV.
+			 */
+
+			/*
+			 * Evaluate if all the join clauses match between the two MCV items.
+			 *
+			 * XXX We might optimize the order of evaluation, using the costs of
+			 * operator functions for individual columns. It does depend on the
+			 * number of distinct values, etc.
+			 */
+			idx = 0;
+			foreach (lc, clauses)
+			{
+				bool	match;
+				int		index1 = indexes1[idx],
+						index2 = indexes2[idx];
+				Datum	value1,
+						value2;
+				bool	reverse_args = reverse[idx];
+
+				/* If either value is null, it's a mismatch */
+				if (mcv2->items[j].isnull[index2])
+					match = false;
+				else
+				{
+					value1 = mcv1->items[i].values[index1];
+					value2 = mcv2->items[j].values[index2];
+
+					/*
+					 * Careful about order of parameters. For same-type equality
+					 * that should not matter, but easy enough.
+					 *
+					 * FIXME Use appropriate collation.
+					 */
+					if (reverse_args)
+						match = DatumGetBool(FunctionCall2Coll(&opprocs[idx],
+															   InvalidOid,
+															   value2, value1));
+					else
+						match = DatumGetBool(FunctionCall2Coll(&opprocs[idx],
+															   InvalidOid,
+															   value1, value2));
+				}
+
+				items_match &= match;
+
+				if (!items_match)
+					break;
+
+				idx++;
+			}
+
+			if (items_match)
+			{
+				/* XXX Do we need to do something about base frequency? */
+				matches1[i] = matches2[j] = true;
+				s += mcv1->items[i].frequency * mcv2->items[j].frequency;
+			}
+		}
+	}
+
+	matchfreq1 = unmatchfreq1 = mcvfreq1 = 0.0;
+	for (i = 0; i < mcv1->nitems; i++)
+	{
+		mcvfreq1 += mcv1->items[i].frequency;
+
+		/* ignore MCV items eliminated by baserel conditions */
+		if (cmatches1 && !cmatches1[i])
+			continue;
+
+		if (matches1[i])
+			matchfreq1 += mcv1->items[i].frequency;
+		else
+			unmatchfreq1 += mcv1->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq1 = 1.0 - mcvfreq1;
+
+	matchfreq2 = unmatchfreq2 = mcvfreq2 = 0.0;
+	for (i = 0; i < mcv2->nitems; i++)
+	{
+		mcvfreq2 += mcv2->items[i].frequency;
+
+		/* ignore MCV items eliminated by baserel conditions */
+		if (cmatches2 && !cmatches2[i])
+			continue;
+
+		if (matches2[i])
+			matchfreq2 += mcv2->items[i].frequency;
+		else
+			unmatchfreq2 += mcv2->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq2 = 1.0 - mcvfreq2;
+
+	/*
+	 * Correction for MCV parts eliminated by the conditions.
+	 *
+	 * We need to be careful about cases where conditions eliminated all
+	 * the MCV items. We must not divide by 0.0, because that would either
+	 * produce bogus value or trigger division by zero. Instead we simply
+	 * set the selectivity to 0.0, because there can't be any matches.
+	 */
+	if ((matchfreq1 + unmatchfreq1) > 0)
+		s = s * mcvfreq1 / (matchfreq1 + unmatchfreq1);
+	else
+		s = 0.0;
+
+	if ((matchfreq2 + unmatchfreq2) > 0)
+		s = s * mcvfreq2 / (matchfreq2 + unmatchfreq2);
+	else
+		s = 0.0;
+
+	/* calculate ndistinct for the expression in join clauses for each rel */
+	nd1 = estimate_num_groups(root, exprs1, rel1->rows, NULL, NULL);
+	nd2 = estimate_num_groups(root, exprs2, rel2->rows, NULL, NULL);
+
+	/*
+	 * Consider the part of the data not represented by the MCV lists.
+	 *
+	 * XXX this is a bit bogus, because we don't know what fraction of
+	 * distinct combinations is covered by the MCV list (we're only
+	 * dealing with some of the columns), so we can't use the same
+	 * formular as eqjoinsel_inner exactly. We just use the estimates
+	 * for the whole table - this is likely an overestimate, because
+	 * (a) items may repeat in the MCV list, if it has more columns,
+	 * and (b) some of the combinations may be present in non-MCV data.
+	 *
+	 * Moreover, we need to look at the conditions. For now we simply
+	 * assume the conditions affect the distinct groups, and use that.
+	 *
+	 * XXX We might calculate the number of distinct groups in the MCV,
+	 * and then use something between (nd1 - distinct(MCV)) and (nd1),
+	 * which are the possible extreme values, assuming the estimates
+	 * are accurate. Maybe mean or geometric mean would work?
+	 *
+	 * XXX Not sure multiplying ndistinct with probabilities is good.
+	 * Maybe we should do something more like estimate_num_groups?
+	 */
+	nd1 *= csel1;
+	nd2 *= csel2;
+
+	totalsel1 = s;
+	totalsel1 += unmatchfreq1 * otherfreq2 / nd2;
+	totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / nd2;
+
+//	if (nd2 > mcvb->nitems)
+//		totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - mcvb->nitems);
+//	if (nd2 > nmatches)
+//		totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) /
+//			(nd2 - nmatches);
+
+	totalsel2 = s;
+	totalsel2 += unmatchfreq2 * otherfreq1 / nd1;
+	totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / nd1;
+
+//	if (nd1 > mcva->nitems)
+//		totalsel2 += unmatchfreq2 * otherfreq1 / (nd1 - mcva->nitems);
+//	if (nd1 > nmatches)
+//		totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) /
+//			(nd1 - nmatches);
+
+	s = Min(totalsel1, totalsel2);
+
+	return s;
+}
+
+
+/*
+ * statext_compare_simple
+ *		Calculte join selectivity using a combination of extended statistics
+ * MCV on one side, and simple per-column MCV on the other.
+ *
+ * Most of the mcv_combine_extended comment applies here too, but we can make
+ * some simplifications because we know the second (per-column) MCV is simpler,
+ * contains no NULL or duplicate values, etc.
+ */
+Selectivity
+mcv_combine_simple(PlannerInfo *root, RelOptInfo *rel, StatisticExtInfo *stat,
+				   AttStatsSlot *sslot, double stanullfrac, double nd,
+				   bool isdefault, Node *clause)
+{
+	MCVList    *mcv;
+	int			i,
+				j;
+	Selectivity s = 0;
+
+	/* match bitmaps and selectivity for baserel conditions (if any) */
+	List   *conditions = NIL;
+	bool   *cmatches = NULL;
+
+	double	csel = 1.0;
+
+	bool   *matches1 = NULL,
+		   *matches2 = NULL;
+
+	/* estimates for the two sides */
+	double	matchfreq1,
+			unmatchfreq1,
+			otherfreq1,
+			mcvfreq1,
+			nd1,
+			totalsel1;
+
+	double 	matchfreq2,
+			unmatchfreq2,
+			otherfreq2,
+			mcvfreq2,
+			nd2,
+			totalsel2;
+
+	List   *exprs1 = NIL,
+		   *exprs2 = NIL;
+
+	/* info about clauses and how they match to MCV stats */
+	FmgrInfo	opproc;
+	int			index;
+	bool		reverse;
+
+	/* we picked the stats so that they have MCV enabled */
+	Assert(stat->kind = STATS_EXT_MCV);
+
+	mcv = statext_mcv_load(stat->statOid);
+
+	/* should only get here with MCV on both sides */
+	Assert(mcv);
+
+	/* Determine which baserel clauses to use for conditional probability. */
+	conditions = statext_determine_join_restrictions(root, rel, stat);
+
+	/*
+	 * Calculate match bitmaps for restrictions on either side of the join
+	 * (there may be none, in which case this will be NULL).
+	 */
+	if (conditions)
+	{
+		cmatches = mcv_get_match_bitmap(root, conditions,
+										 stat->keys, stat->exprs,
+										 mcv, false);
+		csel = clauselist_selectivity(root, conditions, rel->relid, 0, NULL);
+	}
+
+	/*
+	 * Match bitmaps for matches between MCV elements. By default there
+	 * are no matches, so we set all items to 0.
+	 */
+	matches1 = (bool *) palloc0(sizeof(bool) * mcv->nitems);
+
+	/* Matches for the side with just regular single-column MCV. */
+	matches2 = (bool *) palloc0(sizeof(bool) * sslot->nvalues);
+
+	/*
+	 * Initialize information about the clause and how it matches to the
+	 * extended stats we picked. We do this only once before processing
+	 * the lists, so that we don't have to do that for each item or so.
+	 */
+	{
+		OpExpr	   *opexpr;
+		Node	   *expr1,
+				   *expr2;
+		Bitmapset  *relids1,
+				   *relids2;
+
+		/*
+		 * Strip the RestrictInfo node, get the actual clause.
+		 *
+		 * XXX Not sure if we need to care about removing other node types
+		 * too (e.g. RelabelType etc.). statext_is_supported_join_clause
+		 * matches this, but maybe we need to relax it?
+		 */
+		if (IsA(clause, RestrictInfo))
+			clause = (Node *) ((RestrictInfo *) clause)->clause;
+
+		opexpr = (OpExpr *) clause;
+
+		/* Make sure we have the expected node type. */
+		Assert(is_opclause(clause));
+		Assert(list_length(opexpr->args) == 2);
+
+		fmgr_info(get_opcode(opexpr->opno), &opproc);
+
+		/* FIXME strip relabel etc. the way examine_opclause_args does */
+		expr1 = linitial(opexpr->args);
+		expr2 = lsecond(opexpr->args);
+
+		/* determine order of clauses (rel1 op rel2) or (rel2 op rel1) */
+		relids1 = pull_varnos(root, expr1);
+		relids2 = pull_varnos(root, expr2);
+
+		if (bms_singleton_member(relids1) == rel->relid)
+		{
+			Oid		collid;
+
+			index = mcv_match_expression(expr1, stat->keys, stat->exprs,
+										 &collid);
+			reverse = false;
+
+			exprs1 = lappend(exprs1, expr1);
+			exprs2 = lappend(exprs2, expr2);
+		}
+		else if (bms_singleton_member(relids2) == rel->relid)
+		{
+			Oid		collid;
+
+			index = mcv_match_expression(expr2, stat->keys, stat->exprs,
+										 &collid);
+			reverse = true;
+
+			exprs1 = lappend(exprs1, expr2);
+			exprs2 = lappend(exprs2, expr1);
+		}
+		else
+			/* should never happen */
+			Assert(false);
+
+		Assert((index >= 0) &&
+			   (index < bms_num_members(stat->keys) + list_length(stat->exprs)));
+	}
+
+	/*
+	 * Match items between the two MCV lists.
+	 *
+	 * We don't know if the join conditions match all attributes in the MCV, the
+	 * overlap may be just on a subset  of attributes, e.g. (a,b,c) vs. (b,c,d).
+	 * So there may be multiple matches on either side. So we can't optimize by
+	 * aborting the inner loop after the first match, etc.
+	 *
+	 * XXX We can skip the items eliminated by the base restrictions, of course.
+	 *
+	 * XXX We might optimize this in two ways. We might sort the MCV items on
+	 * both sides using the "join" attributes, and then perform somthing like
+	 * merge join. Or we might calculate hash from the join columns, and then
+	 * compare this (to eliminate most expensive equality functions).
+	 */
+	for (i = 0; i < mcv->nitems; i++)
+	{
+		/* skip items eliminated by restrictions on rel1 */
+		if (cmatches && !cmatches[i])
+			continue;
+
+		/*
+		 * We can check mcv1->items[i].isnull[index1] here, because it'll be a
+		 * mismatch anyway (the simple MCV does not contain NULLs).
+		 */
+		if (mcv->items[i].isnull[index])
+			continue;
+
+		/* find matches in the second MCV list */
+		for (j = 0; j < sslot->nvalues; j++)
+		{
+			bool	match;
+			Datum	value1 = mcv->items[i].values[index];
+			Datum	value2 = sslot->values[j];
+
+			/*
+			 * Evaluate the join clause between the two MCV lists. We don't
+			 * need to deal with NULL values here - we've already checked for
+			 * NULL in the extended statistics earlier, and the simple MCV
+			 * does not contain NULL values.
+			*
+			 * Careful about order of parameters. For same-type equality
+			 * that should not matter, but easy enough.
+			 *
+			 * FIXME Use appropriate collation.
+			 */
+			if (reverse)
+				match = DatumGetBool(FunctionCall2Coll(&opproc,
+													   InvalidOid,
+													   value2, value1));
+			else
+				match = DatumGetBool(FunctionCall2Coll(&opproc,
+													   InvalidOid,
+													   value1, value2));
+
+			if (match)
+			{
+				/* XXX Do we need to do something about base frequency? */
+				matches1[i] = matches2[j] = true;
+				s += mcv->items[i].frequency * sslot->numbers[j];
+
+				/*
+				 * We know there can be just a single match in the regular
+				 * MCV list, so we can abort the inner loop.
+				 */
+				break;
+			}
+		}
+	}
+
+	matchfreq1 = unmatchfreq1 = mcvfreq1 = 0.0;
+	for (i = 0; i < mcv->nitems; i++)
+	{
+		mcvfreq1 += mcv->items[i].frequency;
+
+		/* ignore MCV items eliminated by baserel conditions */
+		if (cmatches && !cmatches[i])
+			continue;
+
+		if (matches1[i])
+			matchfreq1 += mcv->items[i].frequency;
+		else
+			unmatchfreq1 += mcv->items[i].frequency;
+	}
+
+	/* not represented by the MCV */
+	otherfreq1 = 1.0 - mcvfreq1;
+
+	matchfreq2 = unmatchfreq2 = mcvfreq2 = 0.0;
+	for (i = 0; i < sslot->nvalues; i++)
+	{
+		mcvfreq2 += sslot->numbers[i];
+
+		if (matches2[i])
+			matchfreq2 += sslot->numbers[i];
+		else
+			unmatchfreq2 += sslot->numbers[i];
+	}
+
+	/* not represented by the MCV */
+	otherfreq2 = 1.0 - mcvfreq2;
+
+	/*
+	 * Correction for MCV parts eliminated by the conditions.
+	 *
+	 * We need to be careful about cases where conditions eliminated all
+	 * the MCV items. We must not divide by 0.0, because that would either
+	 * produce bogus value or trigger division by zero. Instead we simply
+	 * set the selectivity to 0.0, because there can't be any matches.
+	 */
+	if ((matchfreq1 + unmatchfreq1) > 0)
+		s = s * mcvfreq1 / (matchfreq1 + unmatchfreq1);
+	else
+		s = 0.0;
+
+	if ((matchfreq2 + unmatchfreq2) > 0)
+		s = s * mcvfreq2 / (matchfreq2 + unmatchfreq2);
+	else
+		s = 0.0;
+
+	/* calculate ndistinct for the expression in join clauses for each rel */
+	nd1 = estimate_num_groups(root, exprs1, rel->rows, NULL, NULL);
+	nd2 = nd;
+
+	/*
+	 * Consider the part of the data not represented by the MCV lists.
+	 *
+	 * XXX this is a bit bogus, because we don't know what fraction of
+	 * distinct combinations is covered by the MCV list (we're only
+	 * dealing with some of the columns), so we can't use the same
+	 * formular as eqjoinsel_inner exactly. We just use the estimates
+	 * for the whole table - this is likely an overestimate, because
+	 * (a) items may repeat in the MCV list, if it has more columns,
+	 * and (b) some of the combinations may be present in non-MCV data.
+	 *
+	 * Moreover, we need to look at the conditions. For now we simply
+	 * assume the conditions affect the distinct groups, and use that.
+	 *
+	 * XXX We might calculate the number of distinct groups in the MCV,
+	 * and then use something between (nd1 - distinct(MCV)) and (nd1),
+	 * which are the possible extreme values, assuming the estimates
+	 * are accurate. Maybe mean or geometric mean would work?
+	 *
+	 * XXX Not sure multiplying ndistinct with probabilities is good.
+	 * Maybe we should do something more like estimate_num_groups?
+	 */
+	nd1 *= csel;
+
+	totalsel1 = s;
+	totalsel1 += unmatchfreq1 * otherfreq2 / nd2;
+	totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) / nd2;
+
+//	if (nd2 > mcvb->nitems)
+//		totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - mcvb->nitems);
+//	if (nd2 > nmatches)
+//		totalsel1 += otherfreq1 * (otherfreq2 + unmatchfreq2) /
+//			(nd2 - nmatches);
+
+	totalsel2 = s;
+	totalsel2 += unmatchfreq2 * otherfreq1 / nd1;
+	totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) / nd1;
+
+//	if (nd1 > mcva->nitems)
+//		totalsel2 += unmatchfreq2 * otherfreq1 / (nd1 - mcva->nitems);
+//	if (nd1 > nmatches)
+//		totalsel2 += otherfreq2 * (otherfreq1 + unmatchfreq1) /
+//			(nd1 - nmatches);
+
+	s = Min(totalsel1, totalsel2);
+
+	return s;
+}
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index 55cd9252a5..1e51c54fef 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -15,6 +15,7 @@
 #define EXTENDED_STATS_INTERNAL_H
 
 #include "statistics/statistics.h"
+#include "utils/lsyscache.h"
 #include "utils/sortsupport.h"
 
 typedef struct
@@ -127,4 +128,23 @@ extern Selectivity mcv_clause_selectivity_or(PlannerInfo *root,
 											 Selectivity *overlap_basesel,
 											 Selectivity *totalsel);
 
+extern Selectivity mcv_combine_simple(PlannerInfo *root,
+									  RelOptInfo *rel,
+									  StatisticExtInfo *stat,
+									  AttStatsSlot *sslot,
+									  double stanullfrac,
+									  double nd, bool isdefault,
+									  Node *clause);
+
+extern Selectivity mcv_combine_extended(PlannerInfo *root,
+										RelOptInfo *rel1,
+										RelOptInfo *rel2,
+										StatisticExtInfo *stat1,
+										StatisticExtInfo *stat2,
+										List *clauses);
+
+extern List *statext_determine_join_restrictions(PlannerInfo *root,
+												 RelOptInfo *rel,
+												 StatisticExtInfo *info);
+
 #endif							/* EXTENDED_STATS_INTERNAL_H */
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 326cf26fea..4bf27240f6 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -126,4 +126,16 @@ extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 												int nclauses);
 extern HeapTuple statext_expressions_load(Oid stxoid, int idx);
 
+extern StatisticExtInfo *statext_find_matching_mcv(PlannerInfo *root, RelOptInfo *rel,
+										   Bitmapset *attnums, List *exprs);
+
+extern bool statext_try_join_estimates(PlannerInfo *root, List *clauses, int varRelid,
+									   JoinType jointype, SpecialJoinInfo *sjinfo,
+									   Bitmapset *estimatedclauses);
+
+extern Selectivity statext_clauselist_join_selectivity(PlannerInfo *root, List *clauses,
+													   int varRelid,
+													   JoinType jointype, SpecialJoinInfo *sjinfo,
+													   Bitmapset **estimatedclauses);
+
 #endif							/* STATISTICS_H */
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index c60ba45aba..8846d55c23 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -2974,6 +2974,173 @@ SELECT c0 FROM ONLY expr_stats_incompatible_test WHERE
 (0 rows)
 
 DROP TABLE expr_stats_incompatible_test;
+-- Test join estimates.
+CREATE TABLE join_test_1 (a int, b int, c int);
+CREATE TABLE join_test_2 (a int, b int, c int);
+INSERT INTO join_test_1 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+INSERT INTO join_test_2 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+ estimated | actual 
+-----------+--------
+       500 | 100000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+       250 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+ estimated | actual 
+-----------+--------
+        75 |  30000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+       100 |      0
+(1 row)
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+ estimated | actual 
+-----------+--------
+      1250 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+      1000 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+-- create extended statistics on the join/where columns
+CREATE STATISTICS join_stats_1 ON a, b, c, (a+1), (b+1) FROM join_test_1;
+CREATE STATISTICS join_stats_2 ON a, b, c, (a+1), (b+1) FROM join_test_2;
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+ estimated | actual 
+-----------+--------
+    100000 | 100000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+ estimated | actual 
+-----------+--------
+     30000 |  30000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+-- try combining with single-column (and single-expression) statistics
+DROP STATISTICS join_stats_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+-- no MCV on join_test_2 (on the (a+1) expression)
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+      2500 |  50000
+(1 row)
+
+CREATE STATISTICS join_stats_2 ON (a+1) FROM join_test_2;
+ANALYZE join_test_2;
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+ estimated | actual 
+-----------+--------
+     50000 |  50000
+(1 row)
+
+DROP TABLE join_test_1;
+DROP TABLE join_test_2;
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
 -- the underlying table.
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 6fb37962a7..71e59b5279 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -1500,6 +1500,72 @@ SELECT c0 FROM ONLY expr_stats_incompatible_test WHERE
 
 DROP TABLE expr_stats_incompatible_test;
 
+
+-- Test join estimates.
+CREATE TABLE join_test_1 (a int, b int, c int);
+CREATE TABLE join_test_2 (a int, b int, c int);
+
+INSERT INTO join_test_1 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+INSERT INTO join_test_2 SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000) s(i);
+
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+-- create extended statistics on the join/where columns
+CREATE STATISTICS join_stats_1 ON a, b, c, (a+1), (b+1) FROM join_test_1;
+CREATE STATISTICS join_stats_2 ON a, b, c, (a+1), (b+1) FROM join_test_2;
+
+ANALYZE join_test_1;
+ANALYZE join_test_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b))');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 0');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c < 3');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1) and (j1.b = j2.b)) where j1.c < 5 and j2.c > 5');
+
+-- can't be improved due to the optimization in clauselist_selectivity_ext,
+-- which skips cases with a single (join) clause
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on ((j1.a + 1 = j2.a + 1)) where j1.c < 5 and j2.c > 5');
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+-- try combining with single-column (and single-expression) statistics
+DROP STATISTICS join_stats_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+
+-- no MCV on join_test_2 (on the (a+1) expression)
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+CREATE STATISTICS join_stats_2 ON (a+1) FROM join_test_2;
+ANALYZE join_test_2;
+
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a) where j1.c < 5');
+SELECT * FROM check_estimated_rows('select * from join_test_1 j1 join join_test_2 j2 on (j1.a + 1 = j2.a + 1) where j1.c < 5');
+
+
+DROP TABLE join_test_1;
+DROP TABLE join_test_2;
+
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
 -- the underlying table.
-- 
2.31.1

Re: using extended statistics to improve join estimates

Reply via email to