On 9/23/24 20:02, Andrei Lepikhov wrote:
On 12/9/2024 12:12, David Rowley wrote:
On Thu, 12 Sept 2024 at 21:51, Andrei Lepikhov <lepi...@gmail.com> wrote:
Initial problem causes wrong cost_sort estimation. Right now I think
about providing cost_sort() the sort clauses instead of (or in addition
to) the pathkeys.

I'm not quite sure why the sort clauses matter any more than the
EquivalenceClass.  If the EquivalanceClass defines that all members
will have the same value for any given row, then, if we had to choose
any single member to drive the n_distinct estimate from, isn't the
most accurate distinct estimate from the member with the smallest
n_distinct estimate?  (That assumes the less distinct member has every
value the more distinct member has, which might not be true)
Finally, I implemented this approach in code (see attachment).
The effectiveness may be debatable, but general approach looks even better than previous one.
Change status to 'Need review'.
Minor change to make compiler and cfbot happy.

--
regards, Andrei Lepikhov
From bdaf7cd46eb10353b6cbaf5f22d1ea835881db24 Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" <lepi...@gmail.com>
Date: Mon, 23 Sep 2024 14:30:28 +0200
Subject: [PATCH v3] Stabilize incremental sort estimation.

Carefully identify an expression that can represent the path key on sort
estimation. Columns may have different distinct estimations that can trigger
wrong cost estimations and choice of sort strategy.
Sorting has only pathkeys as input for the cost estimation. This patch, instead
of a blind choice of the first equivalence class member, attempts to find
columns that are used under the estimating sort operator and chooses the most
negligible ndistinct value.
---
 src/backend/optimizer/path/costsize.c | 52 ++++++++++++++++++++++++---
 src/backend/optimizer/util/pathnode.c |  1 +
 src/include/optimizer/cost.h          |  1 +
 3 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index e1523d15df..01f166f504 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1984,6 +1984,50 @@ cost_tuplesort(Cost *startup_cost, Cost *run_cost,
 	*run_cost = cpu_operator_cost * tuples;
 }
 
+static Expr *
+identify_sort_expression(PlannerInfo *root, PathKey *key, Relids relids)
+{
+	EquivalenceMember  *candidate = linitial(key->pk_eclass->ec_members);
+	double 				ndistinct = -1.0; /* Not defined */
+
+	if (root == NULL || list_length(key->pk_eclass->ec_members) == 1)
+		/* Fast path */
+		return candidate->em_expr;
+
+	foreach_node(EquivalenceMember, em, key->pk_eclass->ec_members)
+	{
+		VariableStatData	vardata;
+		bool				isdefault = true;
+		double				ndist = -1.0;
+
+		/* Sorting over single partition? */
+		if (em->em_is_child && bms_num_members(relids) > 1)
+			continue;
+
+		if (!bms_is_subset(em->em_relids, relids))
+			/*
+			 * Avoid expressions not based on a table column. At least, the
+			 * candidate value already initialised as the first EC member.
+			 */
+			continue;
+
+		/* Let's check candidate's ndistinct value */
+		examine_variable(root, (Node *) em->em_expr, 0, &vardata);
+		if (HeapTupleIsValid(vardata.statsTuple))
+			ndist = get_variable_numdistinct(&vardata, &isdefault);
+		ReleaseVariableStats(vardata);
+
+		if (ndistinct < 0.0 || (!isdefault && ndist < ndistinct))
+		{
+			candidate = em;
+			ndistinct = ndist;
+		}
+	}
+
+	Assert(candidate != NULL);
+	return candidate->em_expr;
+}
+
 /*
  * cost_incremental_sort
  * 	Determines and returns the cost of sorting a relation incrementally, when
@@ -1999,6 +2043,7 @@ cost_tuplesort(Cost *startup_cost, Cost *run_cost,
 void
 cost_incremental_sort(Path *path,
 					  PlannerInfo *root, List *pathkeys, int presorted_keys,
+					  Relids relids,
 					  int input_disabled_nodes,
 					  Cost input_startup_cost, Cost input_total_cost,
 					  double input_tuples, int width, Cost comparison_cost, int sort_mem,
@@ -2053,21 +2098,20 @@ cost_incremental_sort(Path *path,
 	foreach(l, pathkeys)
 	{
 		PathKey    *key = (PathKey *) lfirst(l);
-		EquivalenceMember *member = (EquivalenceMember *)
-			linitial(key->pk_eclass->ec_members);
+		Expr	   *expr = identify_sort_expression(root, key, relids);
 
 		/*
 		 * Check if the expression contains Var with "varno 0" so that we
 		 * don't call estimate_num_groups in that case.
 		 */
-		if (bms_is_member(0, pull_varnos(root, (Node *) member->em_expr)))
+		if (bms_is_member(0, pull_varnos(root, (Node *) expr)))
 		{
 			unknown_varno = true;
 			break;
 		}
 
 		/* expression not containing any Vars with "varno 0" */
-		presortedExprs = lappend(presortedExprs, member->em_expr);
+		presortedExprs = lappend(presortedExprs, expr);
 
 		if (foreach_current_index(l) + 1 >= presorted_keys)
 			break;
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index fc97bf6ee2..8ed566b1bc 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3055,6 +3055,7 @@ create_incremental_sort_path(PlannerInfo *root,
 
 	cost_incremental_sort(&pathnode->path,
 						  root, pathkeys, presorted_keys,
+						  subpath->parent->relids,
 						  subpath->disabled_nodes,
 						  subpath->startup_cost,
 						  subpath->total_cost,
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 854a782944..ebd5f2c5d9 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -114,6 +114,7 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 					  double limit_tuples);
 extern void cost_incremental_sort(Path *path,
 								  PlannerInfo *root, List *pathkeys, int presorted_keys,
+								  Relids relids,
 								  int input_disabled_nodes,
 								  Cost input_startup_cost, Cost input_total_cost,
 								  double input_tuples, int width, Cost comparison_cost, int sort_mem,
-- 
2.39.5

Reply via email to