Hi!

Current estimation of sort cost has following issues:
 - it doesn't differ one and many columns sort
 - it doesn't pay attention to comparison function cost and column width
 - it doesn't try to count number of calls of comparison function on per column
   basis

At least [1] and [2] hit into to that issues and have an objections/questions about correctness of cost sort estimation. Suggested patch tries to improve current estimation and solve that issues.

Basic ideas:
 - let me introduce some designations
    i   - index of column, started with 1
    N   - number of tuples to sort
    Gi  - number of groups, calculated by i number columns. Obviously,
          G0 == 1. Number of groups is caculated by estimate_num_groups().
    NGi - number of tuples in one group. NG0 == N and NGi = N/G(i-1)
    Fi  - cost of comparison function call of i-th column
    Wi  - average width in bytes of 1-ith column.
    Ci  - total cost of one comparison call of column calculated as
          Fi * fine_function(Wi) where fine_function(Wi) looks like:
          if (Wi <= sizeof(Datum))
              return 1.0; //any type passed in Datum
          else
              return 1 + log16(Wi/sizeof(Datum));
          log16 just an empirical choice
 - So, cost for first column is 2.0 * C0 * log2(N)
   First column is always compared, multiplier 2.0 is defined per regression
   test. Seems, it estimates a cost for transferring tuple to CPU cache,
   starting cost for unpacking tuple, cost of call qsort compare wrapper
   function, etc. Removing this multiplier causes too optimistic prediction of
   cost.
 - cost for i-th columns:
   Ci * log2(NGi) => Ci * log2(N/G(i-1))
   So, here I believe, that i-th comparison function will be called only inside
   group which number is defined by (i-1) leading columns. Note, following
   discussion [1] I add multiplier 1.5 here to be closer to worst case when
   groups are significantly differ. Right now there is no way to determine how
   differ they could be. Note, this formula describes cost of first column too
   except difference with multiplier.
 - Final cost is cpu_operator_cost * N * sum(per column costs described above).
   Note, for single column with width <= sizeof(datum) and F1 = 1 this formula
   gives exactly the same result as current one.
 - for Top-N sort empiric is close to old one: use 2.0 multiplier as constant
   under log2, and use log2(Min(NGi, output_tuples)) for second and following
   columns.

I believe, suggested method could be easy enhanced to support [1] and used in 
[2].

Open items:
 - Fake Var. I faced with generate_append_tlist() which generates Var with
   varno = 0, it was invented, as I can see, at 2002 with comment 'should be
   changed in future'. Future hasn't yet come... I've added workaround to
   prevent call estimate_num_group() with such Vars, but I'm not sure that is
   enough.
 - Costs of all comparison functions now are the same and equal to 1. May be,
   it's time to change that.
 - Empiric constants. I know, it's impossible to remove them at all, but, may
   be, we need to find better estimation of them.

[1] https://commitfest.postgresql.org/18/1124/
[2] https://commitfest.postgresql.org/18/1651/

--
Teodor Sigaev                                   E-mail: teo...@sigaev.ru
                                                   WWW: http://www.sigaev.ru/
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a2a7e0c520..d6b19bb9ad 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1612,6 +1612,231 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 									rterm->pathtarget->width);
 }
 
+/*
+ * is_fake_var
+ *		Workaround for generate_append_tlist() which generates fake Vars with
+ *		varno == 0, that will cause a fail of estimate_num_group() call
+ */
+static bool
+is_fake_var(Expr *expr)
+{
+	if (IsA(expr, RelabelType))
+		expr = (Expr *) ((RelabelType *) expr)->arg;
+
+	return (IsA(expr, Var) && ((Var *) expr)->varno == 0);
+}
+
+/*
+ * get_width_cost_multiplier
+ *		Returns relative complexity of comparing two valyes based on it's width.
+ * The idea behind - long values have more expensive comparison. Return value is
+ * in cpu_operator_cost unit.
+ */
+static double
+get_width_cost_multiplier(PlannerInfo *root, Expr *expr)
+{
+	double	width = -1.0; /* fake value */
+
+	if (IsA(expr, RelabelType))
+		expr = (Expr *) ((RelabelType *) expr)->arg;
+
+	/* Try to find actual stat in corresonding relation */
+	if (IsA(expr, Var))
+	{
+		Var		*var = (Var *) expr;
+
+		if (var->varno > 0 && var->varno < root->simple_rel_array_size)
+		{
+			RelOptInfo	*rel = root->simple_rel_array[var->varno];
+
+			if (rel != NULL &&
+				var->varattno >= rel->min_attr &&
+				var->varattno <= rel->max_attr)
+			{
+				int	ndx = var->varattno - rel->min_attr;
+
+				if (rel->attr_widths[ndx] > 0)
+					width = rel->attr_widths[ndx];
+			}
+		}
+	}
+
+	/* Didn't find any actual stats, use estimation by type */
+	if (width < 0.0)
+	{
+		Node	*node = (Node*) expr;
+
+		width = get_typavgwidth(exprType(node), exprTypmod(node));
+	}
+
+	/*
+	 * Any value in pgsql is passed by Datum type, so any operation with value
+	 * could not be cheaper than operation with Datum type
+	 */
+	if (width <= sizeof(Datum))
+		return 1.0;
+
+	/*
+	 * Seems, cost of comparision is not directly proportional to args width,
+	 * because comparing args could be differ width (we known only average over
+	 * column) and difference often could be defined only by looking on first
+	 * bytes. So, use log16(width) as estimation.
+	 */
+	return 1.0 + 0.125 * LOG2(width / sizeof(Datum));
+}
+
+
+/*
+ * compute_cpu_sort_cost
+ *		compute CPU cost of sort (i.e. in-memory)
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  In this case, it will fallback to
+ * simple comparison cost estimate.
+ */
+
+static Cost
+compute_cpu_sort_cost(PlannerInfo *root, List *pathkeys,
+						   Cost comparison_cost,
+						   double tuples, double output_tuples, bool heapSort)
+{
+	Cost		per_tuple_cost = comparison_cost;
+	ListCell	*lc;
+	List		*pathkeyExprs = NIL;
+	double		tuplesPerGroup;
+	bool		has_fake_var = false;
+
+	/* fallback if pathkeys is unknown */
+	if (list_length(pathkeys) == 0)
+	{
+		/*
+		 * If we'll use a bounded heap-sort keeping just K tuples in memory, for
+		 * a total number of tuple comparisons of N log2 K; but the constant
+		 * factor is a bit higher than for quicksort.  Tweak it so that the
+		 * cost curve is continuous at the crossover point.
+		 */
+		output_tuples = (heapSort) ? 2.0 * output_tuples : tuples;
+		per_tuple_cost += 2.0 * cpu_operator_cost * LOG2(output_tuples);
+
+		return per_tuple_cost * tuples;
+	}
+
+	/*
+	 * Computing total cost of sorting takes into account:
+	 * - per column comparison function cost
+	 * - we try to compute needed number of comparison per column
+	 */
+
+	foreach(lc, pathkeys)
+	{
+		PathKey	*pathkey = (PathKey*)lfirst(lc);
+		Cost	funcCost = 1.0; /* fallback value */
+		EquivalenceMember	*em;
+
+		/*
+		 * We believe than equivalence members aren't very  different, so, to
+		 * estimate cost we take just first member
+		 */
+		em = (EquivalenceMember *) linitial(pathkey->pk_eclass->ec_members);
+
+		if (em->em_datatype != InvalidOid)
+		{
+			Oid		sortop;
+
+			sortop = get_opfamily_member(pathkey->pk_opfamily,
+										 em->em_datatype, em->em_datatype,
+										 pathkey->pk_strategy);
+
+			if (sortop != InvalidOid)
+			{
+				Oid	funcOid = get_opcode(sortop);
+
+				funcCost = get_func_cost(funcOid);
+			}
+		}
+
+		/* Try to take into account actual width fee */
+		funcCost *= get_width_cost_multiplier(root, em->em_expr);
+
+		if (pathkeyExprs == NIL)
+		{
+			/*
+			 * First column is compared always. Multiplier 2.0 is empirical,
+			 * seems, it takes into account transfer tuple into CPU cache,
+			 * starting steps of tuple unpaking and so on.
+			 */
+			per_tuple_cost += 2.0 * funcCost * LOG2(tuples);
+		}
+		else
+		{
+			double		n_groups;
+
+			/*
+			 * Prevent call estimate_num_groups() with fake Var. Note,
+			 * pathkeyExprs contains only previous columns
+			 */
+			if (has_fake_var == false)
+				n_groups = estimate_num_groups(root, pathkeyExprs, tuples, NULL);
+			else if (tuples > 4.0)
+				/*
+				 * Use geometric mean as estimation if there is no any stats.
+				 * Don't use DEFAULT_NUM_DISTINCT because it used for only one
+				 * column while here we try to estimate number of groups over
+				 * set of columns.
+				 */
+				n_groups = ceil(2.0 + sqrt(tuples) *
+					(1 + list_length(pathkeyExprs)) / list_length(pathkeys));
+			else
+				n_groups = tuples;
+
+			/*
+			 * Second and following columns are compared only if header columns
+			 * are the same. Hence, it will be called log2(number of tuples in
+			 * group).
+			 * if number of tuples in group is equal = 1 then comparison
+			 * function of current and followed columns will not be called at
+			 * all.
+			 */
+			tuplesPerGroup = tuples/n_groups;
+
+			if (tuplesPerGroup > 1.0)
+			{
+				/* Top-N sort case: sort only inside output_tuples */
+				if (tuplesPerGroup > output_tuples)
+					tuplesPerGroup = output_tuples;
+
+				/* See comment above about heapsort */
+				if (heapSort)
+					tuplesPerGroup *= 2.0;
+
+				/*
+				 * Real-world distribution isn't uniform but now we don't have a
+				 * way to determine that, so, add multiplier to get closer to
+				 * worst case.
+				 */
+				per_tuple_cost += 1.5 * funcCost * LOG2(tuplesPerGroup);
+			}
+			else
+			{
+				/* we could skip all followed columns for cost estimation */
+				break;
+			}
+		}
+
+		/* remeber if we have a fake var in pathkeys */
+		has_fake_var |= is_fake_var(em->em_expr);
+
+		pathkeyExprs = lappend(pathkeyExprs, em->em_expr);
+	}
+
+	list_free(pathkeyExprs);
+
+	/* per_tuple_cost is in cpu_operator_cost units */
+	per_tuple_cost *= cpu_operator_cost;
+
+	return tuples * per_tuple_cost;
+}
+
 /*
  * cost_sort
  *	  Determines and returns the cost of sorting a relation, including
@@ -1628,7 +1853,7 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * number of initial runs formed and M is the merge order used by tuplesort.c.
  * Since the average initial run should be about sort_mem, we have
  *		disk traffic = 2 * relsize * ceil(logM(p / sort_mem))
- *		cpu = comparison_cost * t * log2(t)
+ * and cpu cost computed by compute_cpu_sort_cost().
  *
  * If the sort is bounded (i.e., only the first k result tuples are needed)
  * and k tuples can fit into sort_mem, we use a heap method that keeps only
@@ -1649,13 +1874,6 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
 void
 cost_sort(Path *path, PlannerInfo *root,
@@ -1682,9 +1900,6 @@ cost_sort(Path *path, PlannerInfo *root,
 	if (tuples < 2.0)
 		tuples = 2.0;
 
-	/* Include the default cost-per-comparison */
-	comparison_cost += 2.0 * cpu_operator_cost;
-
 	/* Do we have a useful LIMIT? */
 	if (limit_tuples > 0 && limit_tuples < tuples)
 	{
@@ -1713,7 +1928,9 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		startup_cost += compute_cpu_sort_cost(root, pathkeys,
+											  comparison_cost, tuples,
+											  tuples, false);
 
 		/* Disk costs */
 
@@ -1729,18 +1946,17 @@ cost_sort(Path *path, PlannerInfo *root,
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
 	{
-		/*
-		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
-		 * a total number of tuple comparisons of N log2 K; but the constant
-		 * factor is a bit higher than for quicksort.  Tweak it so that the
-		 * cost curve is continuous at the crossover point.
-		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		/* We'll use a bounded heap-sort keeping just K tuples in memory. */
+		startup_cost += compute_cpu_sort_cost(root, pathkeys,
+											  comparison_cost, tuples,
+											  output_tuples, true);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		startup_cost += compute_cpu_sort_cost(root, pathkeys,
+											  comparison_cost, tuples,
+											  tuples, false);
 	}
 
 	/*

Reply via email to