Re: Expand applicability of aggregate's sortop optimization

Andrei Lepikhov Tue, 16 Jul 2024 20:29:23 -0700

On 5/8/24 17:13, Matthias van de Meent wrote:

As you may know, aggregates like SELECT MIN(unique1) FROM tenk1; are
rewritten as SELECT unique1 FROM tenk1 ORDER BY unique1 USING < LIMIT
1; by using the optional sortop field in the aggregator.
However, this optimization is disabled for clauses that in itself have
an ORDER BY clause such as `MIN(unique1 ORDER BY <anything>), because
<anything> can cause reordering of distinguisable values like 1.0 and
1.00, which then causes measurable differences in the output. In the
general case, that's a good reason to not apply this optimization, but
in some cases, we could still apply the index optimization.

Thanks for the job! I guess you could be more brave and push down alsoFILTER statement.


One of those cases is fixed in the attached patch: if we order by the
same column that we're aggregating, using the same order class as the
aggregate's sort operator (i.e. the aggregate's sortop is in the same
btree opclass as the ORDER BY's sort operator), then we can still use
the index operation: The sort behaviour isn't changed, thus we can
apply the optimization.

PFA the small patch that implements this.

Note that we can't blindly accept just any ordering by the same
column: If we had an opclass that sorted numeric values by the length
of the significant/mantissa, then that'd provide a different (and
distinct) ordering from that which is expected by the normal
min()/max() aggregates for numeric, which could cause us to return
arguably incorrect results for the aggregate expression.

As I see, the code:
aggsortop = fetch_agg_sort_op(aggref->aggfnoid);
if (!OidIsValid(aggsortop))
  return false;         /* not a MIN/MAX aggregate */

used twice and can be evaluated earlier to avoid duplicated code.

Also, I'm unsure about the necessity of looking through the btreeclasses. Maybe just to check the commutator to the sortop, like in thediff attached? Or could you provide an example to support your approach?


--
regards, Andrei Lepikhov

diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index afb5445b77..e9e95e2b2a 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -253,6 +253,20 @@ can_minmax_aggs(PlannerInfo *root, List **context)
 		if (list_length(aggref->args) != 1)
 			return false;		/* it couldn't be MIN/MAX */
 
+		/*
+		 * We might implement the optimization when a FILTER clause is present
+		 * by adding the filter to the quals of the generated subquery.  For
+		 * now, just punt.
+		 */
+		if (aggref->aggfilter != NULL)
+			return false;
+
+		curTarget = (TargetEntry *) linitial(aggref->args);
+
+		aggsortop = fetch_agg_sort_op(aggref->aggfnoid);
+		if (!OidIsValid(aggsortop))
+			return false;		/* not a MIN/MAX aggregate */
+
 		/*
 		 * ORDER BY is usually irrelevant for MIN/MAX, but it can change the
 		 * outcome if the aggsortop's operator class recognizes non-identical
@@ -267,22 +281,37 @@ can_minmax_aggs(PlannerInfo *root, List **context)
 		 * quickly.
 		 */
 		if (aggref->aggorder != NIL)
-			return false;
-		/* note: we do not care if DISTINCT is mentioned ... */
-
-		/*
-		 * We might implement the optimization when a FILTER clause is present
-		 * by adding the filter to the quals of the generated subquery.  For
-		 * now, just punt.
-		 */
-		if (aggref->aggfilter != NULL)
-			return false;
+		{
+			SortGroupClause *orderClause;
+
+			/*
+			 * If the order clause is the same column as the one we're
+			 * aggregating, we can still use the index: It is undefined which
+			 * value is MIN() or MAX(), as well as which value is first or
+			 * last when sorted. So, we can still use the index IFF the
+			 * aggregated expression equals the expression used in the
+			 * ordering operation.
+			 */
+
+			/*
+			 * We only accept a single argument to min/max aggregates,
+			 * orderings that have more clauses won't provide correct results.
+			 */
+			if (list_length(aggref->aggorder) > 1)
+				return false;
+
+			orderClause = castNode(SortGroupClause, linitial(aggref->aggorder));
+
+			if (orderClause->tleSortGroupRef != curTarget->ressortgroupref)
+				return false;
+
+			if (orderClause->sortop != aggsortop &&
+				orderClause->sortop != get_commutator(aggsortop))
+				return false;
+		}
 
-		aggsortop = fetch_agg_sort_op(aggref->aggfnoid);
-		if (!OidIsValid(aggsortop))
-			return false;		/* not a MIN/MAX aggregate */
+		/* note: we do not care if DISTINCT is mentioned ... */
 
-		curTarget = (TargetEntry *) linitial(aggref->args);
 
 		if (contain_mutable_functions((Node *) curTarget->expr))
 			return false;		/* not potentially indexable */
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index a5596ab210..5d37510d64 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -1003,6 +1003,71 @@ select max(unique1) from tenk1 where unique1 > 42;
  9999
 (1 row)
 
+-- When sorting on the column that's being aggregated, indexes can also be
+-- used, but only when the aggregate's operator has the same ordering behavior
+-- as the ORDER BY-clause, i.e. if it is in the same btree opclass as the one
+-- chosen for the ORDER BY clause.
+explain (costs off)
+  select min(unique1 ORDER BY unique1 ASC NULLS LAST) from tenk1;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Result
+   InitPlan 1
+     ->  Limit
+           ->  Index Only Scan using tenk1_unique1 on tenk1
+                 Index Cond: (unique1 IS NOT NULL)
+(5 rows)
+
+select min(unique1 ORDER BY unique1 ASC NULLS LAST) from tenk1;
+ min 
+-----
+   0
+(1 row)
+
+explain (costs off)
+  select max(unique1 ORDER BY unique1 USING <) from tenk1;
+                             QUERY PLAN                              
+---------------------------------------------------------------------
+ Result
+   InitPlan 1
+     ->  Limit
+           ->  Index Only Scan Backward using tenk1_unique1 on tenk1
+                 Index Cond: (unique1 IS NOT NULL)
+(5 rows)
+
+select max(unique1 ORDER BY unique1 USING <) from tenk1;
+ max  
+------
+ 9999
+(1 row)
+
+explain (costs off)
+  select max(unique1 ORDER BY tenthous) from tenk1;
+          QUERY PLAN           
+-------------------------------
+ Aggregate
+   ->  Sort
+         Sort Key: tenthous
+         ->  Seq Scan on tenk1
+(4 rows)
+
+select max(unique1 ORDER BY tenthous) from tenk1;
+ max  
+------
+ 9999
+(1 row)
+
+--  But even then, the index can't be used if we order by multiple columns.
+explain (costs off)
+  select max(unique1 ORDER BY unique1, tenthous) from tenk1;
+             QUERY PLAN              
+-------------------------------------
+ Aggregate
+   ->  Sort
+         Sort Key: unique1, tenthous
+         ->  Seq Scan on tenk1
+(4 rows)
+
 -- the planner may choose a generic aggregate here if parallel query is
 -- enabled, since that plan will be parallel safe and the "optimized"
 -- plan, which has almost identical cost, will not be.  we want to test
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index ca6d1bcfb7..5cdf336fc0 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -368,6 +368,24 @@ explain (costs off)
   select max(unique1) from tenk1 where unique1 > 42;
 select max(unique1) from tenk1 where unique1 > 42;
 
+-- When sorting on the column that's being aggregated, indexes can also be
+-- used, but only when the aggregate's operator has the same ordering behavior
+-- as the ORDER BY-clause, i.e. if it is in the same btree opclass as the one
+-- chosen for the ORDER BY clause.
+explain (costs off)
+  select min(unique1 ORDER BY unique1 ASC NULLS LAST) from tenk1;
+select min(unique1 ORDER BY unique1 ASC NULLS LAST) from tenk1;
+explain (costs off)
+  select max(unique1 ORDER BY unique1 USING <) from tenk1;
+select max(unique1 ORDER BY unique1 USING <) from tenk1;
+explain (costs off)
+  select max(unique1 ORDER BY tenthous) from tenk1;
+select max(unique1 ORDER BY tenthous) from tenk1;
+
+--  But even then, the index can't be used if we order by multiple columns.
+explain (costs off)
+  select max(unique1 ORDER BY unique1, tenthous) from tenk1;
+
 -- the planner may choose a generic aggregate here if parallel query is
 -- enabled, since that plan will be parallel safe and the "optimized"
 -- plan, which has almost identical cost, will not be.  we want to test

Re: Expand applicability of aggregate's sortop optimization

Reply via email to