Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

Vlada Pogozhelskaya Tue, 18 Feb 2025 08:56:41 -0800

Hi all,

Following the discussion on improving statistics estimation byconsidering GROUP BY as a unique constraint, I’ve prepared a patch thatintegrates GROUP BY into cardinality estimation in a similar way toDISTINCT.

This patch ensures that when a subquery contains a GROUP BY clause, theoptimizer recognizes the grouped columns as unique. The logic follows astraightforward approach, comparing the GROUP BY columns with the targetlist to determine uniqueness.


I’d appreciate any feedback or suggestions for further improvements.

---
regards,
Vlada Pogozhelskaya

On 17.02.2025 08:06, Alexander Korotkov wrote:

On Thu, Nov 28, 2024 at 4:39 AM Andrei Lepikhov<lepi...@gmail.com> wrote:

Thanks to take a look!

On 11/25/24 23:45, Heikki Linnakangas wrote:

On 24/09/2024 08:08, Andrei Lepikhov wrote:

+     * proves the var is unique for this query.  However, we'd better
still
+     * believe the null-fraction statistic.
       */
      if (vardata->isunique)
          stadistinct = -1.0 * (1.0 - stanullfrac);

I wonder about the "we'd better still believe the null-fraction
statistic" part. It makes sense for a unique index, but a DISTINCT or
GROUP BY collapses all the NULLs to a single row. So I think there's
some more work to be done here.

IMO, in that particular case, it is not an issue: having GROUP-BY, we
set vardata->isunique field and disallowed to recurse into the Var
statistics inside subquery - likewise, DISTINCT already does. So, we
have stanullfrac == 0 - it means the optimiser doesn't count the number
of NULLs. In the case of the UNIQUE index, the optimiser will have the
stanullfrac statistic and count NULLs.

This sounds convincing.

But your question raised one another. May we add to a node some
vardata_extra, which could count specific conditions, and let upper
nodes consider it using the Var statistic?
For example, we can separate the 'unique set of columns' knowledge in
such a structure for the Aggregate node. Also, it could be a solution to
problem of counting nulls, generated by RHS of OUTER JOINs in query tree.
What's more, look at the query:

CREATE TABLE gu_2 (x real);
INSERT INTO gu_2 (x) SELECT gs FROM generate_series(1,1000) AS gs;
INSERT INTO gu_2 (x) SELECT NULL FROM generate_series(1,100) AS gs;
VACUUM ANALYZE gu_2;

   HashAggregate  (cost=20.91..22.35 rows=144 width=4)
                  (actual rows=50 loops=1)
     Group Key: gu_2.x
     Batches: 1  Memory Usage: 40kB
     ->  HashAggregate  (cost=19.11..20.55 rows=144 width=4)
         (actual rows=50 loops=1)
           Group Key: gu_2.x
           Batches: 1  Memory Usage: 40kB
           ->  Seq Scan on gu_2  (cost=0.00..18.75 rows=145 width=4)
              (actual rows=149 loops=1)
                 Filter: ((x < '50'::double precision) OR (x IS NULL))
                 Rows Removed by Filter: 951

Here we also could count number of scanned NULLs separately in
vardata_extra and use it in upper GROUP-BY estimation.

What could be the type of vardata_extra?  And what information could
it store?  Yet seems too sketchy for me to understand.

But, I think for now we should go with the original patch.  It seems
to be quite straightforward extension to what 4767bc8ff2 does.  I've
revised commit message and applied pg_indent to sources.  I'm going to
push this if no objections.

------
Regards,
Alexander Korotkov
Supabase

From 7e8a7c74c4e7312c38ccb7b9c4ceb8c129d411c1 Mon Sep 17 00:00:00 2001
From: Vlada Pogozhelskaya <pogozhelsk...@gmail.com>
Date: Fri, 31 Jan 2025 11:16:28 +0700
Subject: [PATCH] Use uniqueness from group by for statistics
 estimation

Tags: optimizer
---
 src/backend/utils/adt/selfuncs.c | 38 +++++++++++++++++++++++++++++++-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index f48d61010c5..e85409c1ff8 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -5803,7 +5803,7 @@ examine_simple_variable(PlannerInfo *root, Var *var,
                 * of learning something even with it.
                 */
                if (subquery->setOperations ||
-                       subquery->groupClause ||
+                       // subquery->groupClause ||
                        subquery->groupingSets)
                        return;
 
@@ -5839,6 +5839,42 @@ examine_simple_variable(PlannerInfo *root, Var *var,
                                 rte->eref->aliasname, var->varattno);
                var = (Var *) ste->expr;
 
+       if (subquery->groupClause)
+       {
+               List *groupVars = NIL;
+               List *targetVars = NIL;
+               ListCell *lc;
+
+               /* Collect unique expressions from GROUP BY */
+               foreach (lc, subquery->groupClause)
+               {
+                       SortGroupClause *sgc = (SortGroupClause *) lfirst(lc);
+                       TargetEntry *tle = 
get_sortgroupref_tle(sgc->tleSortGroupRef, subquery->targetList);
+
+                       if (tle && tle->expr)
+                               groupVars = list_append_unique(groupVars, 
tle->expr);
+               }
+
+               /* Collect unique expressions from the target list */
+               foreach (lc, subquery->targetList)
+               {
+                       TargetEntry *targetTle = (TargetEntry *) lfirst(lc);
+                       if (targetTle && targetTle->expr)
+                               targetVars = list_append_unique(targetVars, 
targetTle->expr);
+               }
+
+               if (equal(groupVars, targetVars))
+               {
+                       vardata->isunique = true;
+               }
+
+               list_free(groupVars);
+               list_free(targetVars);
+       }
+
+
+
+
                /*
                 * If subquery uses DISTINCT, we can't make use of any stats 
for the
                 * variable ... but, if it's the only DISTINCT column, we are 
entitled
-- 
2.46.0

Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

Reply via email to