On 07/12/2018 05:00 PM, Teodor Sigaev wrote:
> 
>> One more thought about estimating the worst case - I wonder if simply
>> multiplying the per-tuple cost by 1.5 is the right approach. It does not
>> seem particularly principled, and it's trivial simple to construct
>> counter-examples defeating it (imagine columns with 99% of the rows
>> having the same value, and then many values in exactly one row - that
>> will defeat the ndistinct-based group size estimates).
> 
> Agree. But right now that is best what we have. and constant 1.5 easily
> could be changed to 1 or 10 - it is just arbitrary value, intuitively
> chosen.  As I mentioned in v7 patch description, new estimation function
> provides ~10% bigger estimation and this constant should not be very
> large, because we could easily get overestimation.
> 
>>
>> The reason why constructing such counter-examples is so simple is that
>> this relies entirely on the ndistinct stats, ignoring the other stats we
>> already have. I think this might leverage the per-column MCV lists (and
>> eventually multi-column MCV lists, if that ever gets committed).
>>
>> We're estimating the number of tuples in group (after fixing values in
>> the preceding columns), because that's what determines the number of
>> comparisons we need to make at a given stage. The patch does this by
>> estimating the average group size, by estimating ndistinct of the
>> preceding columns G(i-1), and computing ntuples/G(i-1).
>>
>> But consider that we also have MCV lists for each column - if there is a
>> very common value, it should be there. So let's say M(i) is a frequency
>> of the most common value in i-th column, determined either from the MCV
>> list or as 1/ndistinct for that column. Then if m(i) is minimum of M(i)
>> for the fist i columns, then the largest possible group of values when
>> comparing values in i-th column is
>>
>>      N * m(i-1)
>>
>> This seems like a better way to estimate the worst case. I don't know if
>> this should be used instead of NG(i), or if those two estimates should
>> be combined in some way.
> Agree, definitely we need an estimation of larger group size to use it
> in cost_sort. But I don't feel power to do that, pls, could you attack a
> computing group size issue? Thank you.
> 

Attached is a simple patch illustrating this MCV-based approach. I don't
claim it's finished / RFC, but hopefully it's sufficient to show what I
meant. I'm not sure how to plug it into the v8 of your patch at this
point, so I've simply added an elog to estimate_num_groups to also print
the estimate or largest group for GROUP BY queries.

It's largely a simplified copy of estimate_num_groups() and there's a
couple of comments of how it might be improved further.

>>
>> I think estimating the largest group we need to sort should be helpful
>> for the incremental sort patch, so I'm adding Alexander to this
>> thread.Agree
> I think so. suggested estimation algorithm should be modified to support
> leading preordered keys and this form could be directly used in GROUP BY
> and incremental sort patches.

Right. What I think the function estimating the group size could do in
case of incremental sort is producing two values - maximum size of the
leading group, and maximum group size overall. The first one would be
useful for startup cost, the second one for total cost.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index f1c78ffb65..2d91c17866 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3424,6 +3424,9 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 	ListCell   *l;
 	int			i;
 
+	/* DEBUG */
+	elog(WARNING, "estimate_largest_group (%d exprs) %f", list_length(groupExprs), estimate_largest_group(root, groupExprs, input_rows));
+
 	/*
 	 * We don't ever want to return an estimate of zero groups, as that tends
 	 * to lead to division-by-zero and other unpleasantness.  The input_rows
@@ -3735,6 +3738,265 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 }
 
 /*
+ * estimate_largest_group	- Estimate size of largest group of rows
+ *
+ * Given a query having a GROUP BY or ORDER BY clause, estimate the size
+ * of the largest group.
+ *
+ * Inputs:
+ *	root - the query
+ *	groupExprs - list of expressions being grouped by
+ *	input_rows - number of rows estimated to arrive at the group/unique
+ *		filter step
+ *
+ * Estimating average size of a group can be done simply as 1/N where
+ * N is the number of groups obtained from estimate_num_groups. But for
+ * various places we are more interested in a worst-case scenario, i.e.
+ * the size of the largest group of rows (which may be significantly
+ * larger than the average group).
+ *
+ * To estimate size of the largest group, we use MCV (and additional
+ * statistics like null_frac) for individual columns.
+ * 
+ * Given the lack of any cross-correlation statistics in the system, it's
+ * impossible to do anything really trustworthy with GROUP BY and ORDER BY
+ * conditions involving multiple Vars.
+ *
+ * Our current approach is as follows:
+ *	1.  Reduce the given expressions to a list of unique Vars used.  For
+ *		example, ORDER BY a, a + b is treated the same as ORDER BY a, b.
+ *		It is clearly correct not to count the same Var more than once.
+ *		It is also reasonable to treat f(x) the same as x: f() cannot
+ *		increase the number of distinct values (unless it is volatile,
+ *		which we consider unlikely for grouping or sorting), but it
+ *		probably won't reduce the number of distinct values much either.
+ *		As a special case, if the expression can be matched to an
+ *		expressional index for which we have statistics, then we treat
+ *		the whole expression as though it were just a Var.
+ *	2.  For Vars within a single source rel, we determine the largest group
+ *		for each (from MCV list, null_frac and ndistinct), and then compute
+ *		the minimum of these values.
+ *	3.  If there are Vars from multiple rels, we repeat step 2 for each such
+ *		rel, and multiply the results together.
+ */
+double
+estimate_largest_group(PlannerInfo *root, List *groupExprs, double input_rows)
+{
+	List	   *varinfos = NIL;
+	double		srf_multiplier = 1.0;
+	ListCell   *l;
+	int			i;
+
+	/* frequency of the largest group */
+	double		min_group_frequency = 1.0;
+	double		max_group_rows;
+
+	/*
+	 * We don't ever want to return an estimate of zero groups, as that tends
+	 * to lead to division-by-zero and other unpleasantness.  The input_rows
+	 * estimate is usually already at least 1, but clamp it just in case it
+	 * isn't.
+	 */
+	input_rows = clamp_row_est(input_rows);
+
+	/*
+	 * If no grouping columns, there's exactly one group.  (This can't happen
+	 * for normal cases with GROUP BY or DISTINCT, but it is possible for
+	 * corner cases with set operations.)
+	 */
+	if (groupExprs == NIL)
+		return input_rows;
+
+	i = 0;
+	foreach(l, groupExprs)
+	{
+		Node	   *groupexpr = (Node *) lfirst(l);
+		double		this_srf_multiplier;
+		VariableStatData vardata;
+		List	   *varshere;
+		ListCell   *l2;
+
+		/*
+		 * Set-returning functions in grouping columns are a bit problematic.
+		 * The code below will effectively ignore their SRF nature and come up
+		 * with a numdistinct estimate as though they were scalar functions.
+		 * We compensate by scaling up the end result by the largest SRF
+		 * rowcount estimate.  (This will be an overestimate if the SRF
+		 * produces multiple copies of any output value, but it seems best to
+		 * assume the SRF's outputs are distinct.  In any case, it's probably
+		 * pointless to worry too much about this without much better
+		 * estimates for SRF output rowcounts than we have today.)
+		 */
+		this_srf_multiplier = expression_returns_set_rows(groupexpr);
+		if (srf_multiplier < this_srf_multiplier)
+			srf_multiplier = this_srf_multiplier;
+
+		/*
+		 * If examine_variable is able to deduce anything about the GROUP BY
+		 * expression, treat it as a single variable even if it's really more
+		 * complicated.
+		 */
+		examine_variable(root, groupexpr, 0, &vardata);
+		if (HeapTupleIsValid(vardata.statsTuple) || vardata.isunique)
+		{
+			varinfos = add_unique_group_var(root, varinfos,
+											groupexpr, &vardata);
+			ReleaseVariableStats(vardata);
+			continue;
+		}
+		ReleaseVariableStats(vardata);
+
+		/*
+		 * Else pull out the component Vars.  Handle PlaceHolderVars by
+		 * recursing into their arguments (effectively assuming that the
+		 * PlaceHolderVar doesn't change the number of groups, which boils
+		 * down to ignoring the possible addition of nulls to the result set).
+		 */
+		varshere = pull_var_clause(groupexpr,
+								   PVC_RECURSE_AGGREGATES |
+								   PVC_RECURSE_WINDOWFUNCS |
+								   PVC_RECURSE_PLACEHOLDERS);
+
+		/*
+		 * If we find any variable-free GROUP BY item, then either it is a
+		 * constant (and we can ignore it) or it contains a volatile function;
+		 * in the latter case we punt and assume that each input row will
+		 * yield a distinct group.
+		 */
+		if (varshere == NIL)
+		{
+			if (contain_volatile_functions(groupexpr))
+				return 1.0;	/* XXX Maybe return srf_multiplier instead? */
+			continue;
+		}
+
+		/*
+		 * Else add variables to varinfos list
+		 */
+		foreach(l2, varshere)
+		{
+			Node	   *var = (Node *) lfirst(l2);
+
+			examine_variable(root, var, 0, &vardata);
+			varinfos = add_unique_group_var(root, varinfos, var, &vardata);
+			ReleaseVariableStats(vardata);
+		}
+	}
+
+	/*
+	 * If now no Vars, we must have an all-constant GROUP BY list, so the
+	 * whole result is one large group.
+	 *
+	 * XXX Apply the srf_multiplier here?
+	 */
+	if (varinfos == NIL)
+		return input_rows;
+
+	/*
+	 * Group Vars by relation and estimate per-relation largest group.
+	 *
+	 * For each iteration of the outer loop, we process the frontmost Var in
+	 * varinfos, plus all other Vars in the same relation.  We remove these
+	 * Vars from the newvarinfos list for the next iteration. This is the
+	 * easiest way to group Vars of same rel together.
+	 */
+	do
+	{
+		GroupVarInfo *varinfo1 = (GroupVarInfo *) linitial(varinfos);
+		List	   *newvarinfos = NIL;
+		List	   *relvarinfos = NIL;
+
+		/*
+		 * Split the list of varinfos in two - one for the current rel, one
+		 * for remaining Vars on other rels.
+		 */
+		relvarinfos = lcons(varinfo1, relvarinfos);
+		for_each_cell(l, lnext(list_head(varinfos)))
+		{
+			GroupVarInfo *varinfo2 = (GroupVarInfo *) lfirst(l);
+
+			if (varinfo2->rel == varinfo1->rel)
+			{
+				/* varinfos on current rel */
+				relvarinfos = lcons(varinfo2, relvarinfos);
+			}
+			else
+			{
+				/* not time to process varinfo2 yet */
+				newvarinfos = lcons(varinfo2, newvarinfos);
+			}
+		}
+
+		/*
+		 * Get the largest group estimate for the Vars of this rel.  For
+		 * each Var we look for MCV first - if there's one, we take the
+		 * frequency of the first (most common) item. If there's no MCV
+		 * for a Var, we assume all groups are about equal, and instead
+		 * use 1/ndistinct.
+		 *
+		 * XXX This should probably consider null_frac.
+		 */
+		foreach(l, relvarinfos)
+		{
+			GroupVarInfo   *varinfo2 = (GroupVarInfo *) lfirst(l);
+			Node		   *var = varinfo2->var;
+			AttStatsSlot	sslot;
+			VariableStatData vardata;
+			double			group_frequency;
+
+			examine_variable(root, var, 0, &vardata);
+			varinfos = add_unique_group_var(root, varinfos, var, &vardata);
+
+			/* assume we have uniformly distributed groups */
+			group_frequency = 1.0 / varinfo2->ndistinct;
+
+			/* but if we have MCV list, we take the first value
+			 *
+			 * XXX We could also sort the MCV items, and take the first
+			 * one (this would help for incremental sort, particularly
+			 * when caring about startup cost, as e.g. for LIMIT). We
+			 * could also produce two values - largest first group and
+			 * largest group in general, and use both for costing.
+			 */
+			if (HeapTupleIsValid(vardata.statsTuple) &&
+				get_attstatsslot(&sslot, vardata.statsTuple,
+								 STATISTIC_KIND_MCV, InvalidOid,
+								 ATTSTATSSLOT_NUMBERS))
+				group_frequency = sslot.numbers[i];
+
+			/*
+			 * TODO perhaps consider null_frac here, depending on NULLS
+			 * FIRST/LAST (again, more relevant to incremental sort).
+			 */
+
+			if (group_frequency < min_group_frequency)
+				min_group_frequency = group_frequency;
+
+			ReleaseVariableStats(vardata);
+		}
+
+		varinfos = newvarinfos;
+	} while (varinfos != NIL);
+
+	/* compute the group size (frequency -> rows) */
+	max_group_rows = min_group_frequency * input_rows;
+
+	/* Now we can account for the effects of any SRFs */
+	max_group_rows *= srf_multiplier;
+
+	/* Round off */
+	max_group_rows = ceil(max_group_rows);
+
+	/* Guard against out-of-range answers */
+	if (max_group_rows > input_rows)
+		max_group_rows = input_rows;
+	if (max_group_rows < 1.0)
+		max_group_rows = 1.0;
+
+	return max_group_rows;
+}
+
+/*
  * Estimate hash bucket statistics when the specified expression is used
  * as a hash key for the given number of buckets.
  *
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 95e44280c4..976df7720f 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -209,6 +209,9 @@ extern void mergejoinscansel(PlannerInfo *root, Node *clause,
 extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
 					double input_rows, List **pgset);
 
+extern double estimate_largest_group(PlannerInfo *root, List *groupExprs,
+					double input_rows);
+
 extern void estimate_hash_bucket_stats(PlannerInfo *root,
 						   Node *hashkey, double nbuckets,
 						   Selectivity *mcv_freq,

Reply via email to