Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

Andrei Lepikhov Thu, 19 Sep 2024 00:55:43 -0700

Hi,

One database app developer complied that He constantly has a problemwith row estimations in joins over groupings and provided the demo example:


CREATE TABLE line (id int PRIMARY KEY, docId int, amount numeric);
CREATE INDEX line_doc ON line (docid);
INSERT INTO line (id, docId, amount)
  (SELECT docId*100 + id AS id, docId, random() AS amount
  FROM generate_series(1, 10) AS id,
       generate_series(1, 25000) AS docid);
INSERT INTO line (id, docId, amount)
  (SELECT docId*100 + id AS id, docId, random() AS amount
  FROM generate_series(1, 20) AS id,
       generate_series(25001, 50000) AS docid);
INSERT INTO line (id, docId, amount)
  (SELECT docId*100 + id AS id, docId, random() AS amount
  FROM generate_series(1, 50) AS id,
       generate_series(50001, 75000) AS docid);
INSERT INTO line (id, docId, amount)
  (SELECT docId*100 + id AS id, docId, random() AS amount
  FROM generate_series(1, 100) AS id,
       generate_series(75001, 100000) AS docid);
CREATE TABLE tmp (id int PRIMARY KEY);
INSERT INTO tmp (id) SELECT * FROM generate_series(1, 50);
ANALYZE line, tmp;

EXPLAIN
SELECT tmp.id, sq.amount FROM tmp
  LEFT JOIN
    (SELECT docid, SUM(amount) AS amount FROM line
      JOIN tmp ON tmp.id = docid GROUP BY 1) sq
  ON sq.docid = tmp.id;

with this query we have bad estimation of the top JOIN:

 Hash Right Join  (rows=855)
   Hash Cond: (line.docid = tmp.id)
   ->  GroupAggregate  (cost=3.49..117.25 rows=3420 width=36)
         Group Key: line.docid
         ->  Merge Join  (cost=3.49..57.40 rows=3420 width=15)
               Merge Cond: (line.docid = tmp_1.id)
                ...

This wrong prediction makes things much worse if the query has moreupper query blocks.His question was: Why not consider the grouping column unique in theupper query block? It could improve estimations.After a thorough investigation, I discovered that in commit 4767bc8ff2most of the work was already done for DISTINCT clauses. So, why not dothe same for grouping? A sketch of the patch is attached.As I see it, grouping in this sense works quite similarly to DISTINCT,and we have no reason to ignore it. After applying the patch, you cansee that prediction has been improved:


Hash Right Join  (cost=5.62..162.56 rows=50 width=36)

--
regards, Andrei Lepikhov

From 0c440c2dd4832c155849c347915a64b9878c6f28 Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" <[email protected]>
Date: Wed, 18 Sep 2024 19:25:02 +0200
Subject: [PATCH] Improve statistics estimation considering GROUP-BY as a
 'uniqueiser'.

This commit follows the idea of the 4767bc8ff2. GROUP-BY makes the grouped
column 'highly likely unique'. We can employ this fact in the statistics to
make more precise estimations in the upper query block.
---
 src/backend/utils/adt/selfuncs.c | 37 ++++++++++++++++++++------------
 src/include/utils/selfuncs.h     |  2 +-
 2 files changed, 24 insertions(+), 15 deletions(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 03d7fb5f48..b6ce8978c1 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -321,8 +321,8 @@ var_eq_const(VariableStatData *vardata, Oid oproid, Oid 
collation,
        }
 
        /*
-        * If we matched the var to a unique index or DISTINCT clause, assume
-        * there is exactly one match regardless of anything else.  (This is
+        * If we matched the var to a unique index, DISTINCT or GROUP-BY clause,
+        * assume there is exactly one match regardless of anything else.  
(This is
         * slightly bogus, since the index or clause's equality operator might 
be
         * different from ours, but it's much more likely to be right than
         * ignoring the information.)
@@ -483,8 +483,8 @@ var_eq_non_const(VariableStatData *vardata, Oid oproid, Oid 
collation,
        }
 
        /*
-        * If we matched the var to a unique index or DISTINCT clause, assume
-        * there is exactly one match regardless of anything else.  (This is
+        * If we matched the var to a unique index, DISTINCT or GROUP-BY clause,
+        * assume there is exactly one match regardless of anything else.  
(This is
         * slightly bogus, since the index or clause's equality operator might 
be
         * different from ours, but it's much more likely to be right than
         * ignoring the information.)
@@ -5005,9 +5005,9 @@ ReleaseDummy(HeapTuple tuple)
  *     atttype, atttypmod: actual type/typmod of the "var" expression.  This is
  *             commonly the same as the exposed type of the variable argument,
  *             but can be different in binary-compatible-type cases.
- *     isunique: true if we were able to match the var to a unique index or a
- *             single-column DISTINCT clause, implying its values are unique 
for
- *             this query.  (Caution: this should be trusted for statistical
+ *     isunique: true if we were able to match the var to a unique index, a
+ *             single-column DISTINCT or GROUP-BY clause, implying its values 
are
+ *             unique for this query.  (Caution: this should be trusted for 
statistical
  *             purposes only, since we do not check indimmediate nor verify 
that
  *             the exact same definition of equality applies.)
  *     acl_ok: true if current user has permission to read the column(s)
@@ -5654,7 +5654,7 @@ examine_simple_variable(PlannerInfo *root, Var *var,
                Assert(IsA(subquery, Query));
 
                /*
-                * Punt if subquery uses set operations or GROUP BY, as these 
will
+                * Punt if subquery uses set operations or grouping sets, as 
these will
                 * mash underlying columns' stats beyond recognition.  (Set ops 
are
                 * particularly nasty; if we forged ahead, we would return stats
                 * relevant to only the leftmost subselect...)  DISTINCT is also
@@ -5662,7 +5662,6 @@ examine_simple_variable(PlannerInfo *root, Var *var,
                 * of learning something even with it.
                 */
                if (subquery->setOperations ||
-                       subquery->groupClause ||
                        subquery->groupingSets)
                        return;
 
@@ -5692,6 +5691,16 @@ examine_simple_variable(PlannerInfo *root, Var *var,
                        return;
                }
 
+               /* The same idea as with DISTINCT clause works for a GROUP-BY 
too */
+               if (subquery->groupClause)
+               {
+                       if (list_length(subquery->groupClause) == 1 &&
+                               targetIsInSortList(ste, InvalidOid, 
subquery->groupClause))
+                               vardata->isunique = true;
+                       /* cannot go further */
+                       return;
+               }
+
                /*
                 * If the sub-query originated from a view with the 
security_barrier
                 * attribute, we must not look at the variable's statistics, 
though it
@@ -5843,11 +5852,11 @@ get_variable_numdistinct(VariableStatData *vardata, 
bool *isdefault)
        }
 
        /*
-        * If there is a unique index or DISTINCT clause for the variable, 
assume
-        * it is unique no matter what pg_statistic says; the statistics could 
be
-        * out of date, or we might have found a partial unique index that 
proves
-        * the var is unique for this query.  However, we'd better still believe
-        * the null-fraction statistic.
+        * If there is a unique index, DISTINCT or GROUP-BY clause for the 
variable,
+        * assume it is unique no matter what pg_statistic says; the statistics
+        * could be out of date, or we might have found a partial unique index 
that
+        * proves the var is unique for this query.  However, we'd better still
+        * believe the null-fraction statistic.
         */
        if (vardata->isunique)
                stadistinct = -1.0 * (1.0 - stanullfrac);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index f2563ad1cb..b1ac81977b 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -94,7 +94,7 @@ typedef struct VariableStatData
        Oid                     vartype;                /* exposed type of 
expression */
        Oid                     atttype;                /* actual type (after 
stripping relabel) */
        int32           atttypmod;              /* actual typmod (after 
stripping relabel) */
-       bool            isunique;               /* matches unique index or 
DISTINCT clause */
+       bool            isunique;               /* matches unique index, 
DISTINCT or GROUP-BY clause */
        bool            acl_ok;                 /* result of ACL check on table 
or column */
 } VariableStatData;
 
-- 
2.46.1

Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

Reply via email to