Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

Andrei Lepikhov Tue, 18 Feb 2025 04:52:35 -0800

On 17/2/2025 02:06, Alexander Korotkov wrote:

On Thu, Nov 28, 2024 at 4:39 AM Andrei Lepikhov <lepi...@gmail.com> wrote:

Here we also could count number of scanned NULLs separately in
vardata_extra and use it in upper GROUP-BY estimation.


What could be the type of vardata_extra?  And what information could
it store?  Yet seems too sketchy for me to understand.

It is actually sketchy. Our estimation routines have no informationabout intermediate modifications of the data. Left-join generated NULLsis a good example here. So, my vague idea is to maintain that info andchange statistical estimations somehow.

Of course, it is out of the scope here.


But, I think for now we should go with the original patch.  It seems
to be quite straightforward extension to what 4767bc8ff2 does.  I've
revised commit message and applied pg_indent to sources.  I'm going to
push this if no objections.

Ok, I added one regression test to check that feature works properly.

--
regards, Andrei Lepikhov

From d2e70f544b077de4f5b85838d6071ab0243460ce Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" <lepi...@gmail.com>
Date: Wed, 18 Sep 2024 19:25:02 +0200
Subject: [PATCH v3] Improve statistics estimation for single-column GROUP BY
 in sub-queries

This commit follows the idea of the 4767bc8ff2.  If sub-query has only one
GROUP BY column, we can consider its output variable as being unique. We can
employ this fact in the statistics to make more precise estimations in the
upper query block.

Author: Andrei Lepikhov <lepi...@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinn...@iki.fi>
Reviewed-by: Alexander Korotkov <aekorot...@gmail.com>
---
 src/backend/utils/adt/selfuncs.c        | 49 +++++++++++++++----------
 src/include/utils/selfuncs.h            |  3 +-
 src/test/regress/expected/stats_ext.out | 17 +++++++++
 src/test/regress/sql/stats_ext.sql      | 14 +++++++
 4 files changed, 62 insertions(+), 21 deletions(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index d3d1e485bb..40712ab718 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -322,10 +322,10 @@ var_eq_const(VariableStatData *vardata, Oid oproid, Oid 
collation,
        }
 
        /*
-        * If we matched the var to a unique index or DISTINCT clause, assume
-        * there is exactly one match regardless of anything else.  (This is
-        * slightly bogus, since the index or clause's equality operator might 
be
-        * different from ours, but it's much more likely to be right than
+        * If we matched the var to a unique index, DISTINCT or GROUP-BY clause,
+        * assume there is exactly one match regardless of anything else.  (This
+        * is slightly bogus, since the index or clause's equality operator 
might
+        * be different from ours, but it's much more likely to be right than
         * ignoring the information.)
         */
        if (vardata->isunique && vardata->rel && vardata->rel->tuples >= 1.0)
@@ -484,10 +484,10 @@ var_eq_non_const(VariableStatData *vardata, Oid oproid, 
Oid collation,
        }
 
        /*
-        * If we matched the var to a unique index or DISTINCT clause, assume
-        * there is exactly one match regardless of anything else.  (This is
-        * slightly bogus, since the index or clause's equality operator might 
be
-        * different from ours, but it's much more likely to be right than
+        * If we matched the var to a unique index, DISTINCT or GROUP-BY clause,
+        * assume there is exactly one match regardless of anything else.  (This
+        * is slightly bogus, since the index or clause's equality operator 
might
+        * be different from ours, but it's much more likely to be right than
         * ignoring the information.)
         */
        if (vardata->isunique && vardata->rel && vardata->rel->tuples >= 1.0)
@@ -5018,9 +5018,9 @@ ReleaseDummy(HeapTuple tuple)
  *     atttype, atttypmod: actual type/typmod of the "var" expression.  This is
  *             commonly the same as the exposed type of the variable argument,
  *             but can be different in binary-compatible-type cases.
- *     isunique: true if we were able to match the var to a unique index or a
- *             single-column DISTINCT clause, implying its values are unique 
for
- *             this query.  (Caution: this should be trusted for statistical
+ *     isunique: true if we were able to match the var to a unique index, a
+ *             single-column DISTINCT or GROUP-BY clause, implying its values 
are
+ *             unique for this query.  (Caution: this should be trusted for 
statistical
  *             purposes only, since we do not check indimmediate nor verify 
that
  *             the exact same definition of equality applies.)
  *     acl_ok: true if current user has permission to read the column(s)
@@ -5680,15 +5680,14 @@ examine_simple_variable(PlannerInfo *root, Var *var,
                Assert(IsA(subquery, Query));
 
                /*
-                * Punt if subquery uses set operations or GROUP BY, as these 
will
-                * mash underlying columns' stats beyond recognition.  (Set ops 
are
-                * particularly nasty; if we forged ahead, we would return stats
+                * Punt if subquery uses set operations or grouping sets, as 
these
+                * will mash underlying columns' stats beyond recognition.  
(Set ops
+                * are particularly nasty; if we forged ahead, we would return 
stats
                 * relevant to only the leftmost subselect...)  DISTINCT is also
                 * problematic, but we check that later because there is a 
possibility
                 * of learning something even with it.
                 */
                if (subquery->setOperations ||
-                       subquery->groupClause ||
                        subquery->groupingSets)
                        return;
 
@@ -5718,6 +5717,16 @@ examine_simple_variable(PlannerInfo *root, Var *var,
                        return;
                }
 
+               /* The same idea as with DISTINCT clause works for a GROUP-BY 
too */
+               if (subquery->groupClause)
+               {
+                       if (list_length(subquery->groupClause) == 1 &&
+                               targetIsInSortList(ste, InvalidOid, 
subquery->groupClause))
+                               vardata->isunique = true;
+                       /* cannot go further */
+                       return;
+               }
+
                /*
                 * If the sub-query originated from a view with the 
security_barrier
                 * attribute, we must not look at the variable's statistics, 
though it
@@ -5869,11 +5878,11 @@ get_variable_numdistinct(VariableStatData *vardata, 
bool *isdefault)
        }
 
        /*
-        * If there is a unique index or DISTINCT clause for the variable, 
assume
-        * it is unique no matter what pg_statistic says; the statistics could 
be
-        * out of date, or we might have found a partial unique index that 
proves
-        * the var is unique for this query.  However, we'd better still believe
-        * the null-fraction statistic.
+        * If there is a unique index, DISTINCT or GROUP-BY clause for the
+        * variable, assume it is unique no matter what pg_statistic says; the
+        * statistics could be out of date, or we might have found a partial
+        * unique index that proves the var is unique for this query.  However,
+        * we'd better still believe the null-fraction statistic.
         */
        if (vardata->isunique)
                stadistinct = -1.0 * (1.0 - stanullfrac);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index b5e95c0ff6..d35939651f 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -94,7 +94,8 @@ typedef struct VariableStatData
        Oid                     vartype;                /* exposed type of 
expression */
        Oid                     atttype;                /* actual type (after 
stripping relabel) */
        int32           atttypmod;              /* actual typmod (after 
stripping relabel) */
-       bool            isunique;               /* matches unique index or 
DISTINCT clause */
+       bool            isunique;               /* matches unique index, 
DISTINCT or GROUP-BY
+                                                                * clause */
        bool            acl_ok;                 /* result of ACL check on table 
or column */
 } VariableStatData;
 
diff --git a/src/test/regress/expected/stats_ext.out 
b/src/test/regress/expected/stats_ext.out
index 9a820404d3..7125665a94 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -3368,3 +3368,20 @@ NOTICE:  drop cascades to 2 other objects
 DETAIL:  drop cascades to table tststats.priv_test_tbl
 drop cascades to view tststats.priv_test_view
 DROP USER regress_stats_user1;
+CREATE TABLE grouping_unique_1 (x integer);
+CREATE TABLE grouping_unique_2 (x integer);
+INSERT INTO grouping_unique_1 (x) VALUES (1);
+INSERT INTO grouping_unique_2 (x) SELECT gs FROM generate_series(1,1000) AS gs;
+ANALYZE grouping_unique_1,grouping_unique_2;
+-- Optimiser treat GROUP-BY operator as an 'uniqueser' of the input
+SELECT * FROM check_estimated_rows('
+  SELECT * FROM grouping_unique_1 t1 LEFT JOIN (
+    SELECT x FROM grouping_unique_2 t2 GROUP BY x) AS q1
+  ON t1.x=q1.x;
+');
+ estimated | actual 
+-----------+--------
+         1 |      1
+(1 row)
+
+DROP TABLE grouping_unique_1, grouping_unique_2;
diff --git a/src/test/regress/sql/stats_ext.sql 
b/src/test/regress/sql/stats_ext.sql
index 75b04e5a13..ca327f69b3 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -1707,3 +1707,17 @@ RESET SESSION AUTHORIZATION;
 DROP TABLE stats_ext_tbl;
 DROP SCHEMA tststats CASCADE;
 DROP USER regress_stats_user1;
+
+CREATE TABLE grouping_unique_1 (x integer);
+CREATE TABLE grouping_unique_2 (x integer);
+INSERT INTO grouping_unique_1 (x) VALUES (1);
+INSERT INTO grouping_unique_2 (x) SELECT gs FROM generate_series(1,1000) AS gs;
+ANALYZE grouping_unique_1,grouping_unique_2;
+
+-- Optimiser treat GROUP-BY operator as an 'uniqueser' of the input
+SELECT * FROM check_estimated_rows('
+  SELECT * FROM grouping_unique_1 t1 LEFT JOIN (
+    SELECT x FROM grouping_unique_2 t2 GROUP BY x) AS q1
+  ON t1.x=q1.x;
+');
+DROP TABLE grouping_unique_1, grouping_unique_2;
-- 
2.48.1

Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

Reply via email to