Hi hackers,

Hi hackers,

Now we can create a table with a virtual generated column, but
when a condition in WHERE clause contains virtual generated column,
estimated rows are not correct since no statistics on this is
collectef.

[Ex.1]

 test=# CREATE TABLE t (i int, v int GENERATED ALWAYS AS (i+1) VIRTUAL);
 CREATE TABLE

 test=# INSERT INTO t SELECT generate_series(1,1000);
 INSERT 0 1000

 test=# INSERT INTO t SELECT 1 FROM generate_series(1,1000);
 INSERT 0 1000

 test=# EXPLAIN ANALYZE SELECT * FROM t WHERE v = 2;
                                             QUERY PLAN                         
                   
 
--------------------------------------------------------------------------------------------------
  Seq Scan on t  (cost=0.00..36.02 rows=9 width=8) (actual time=0.093..3.059 
rows=1001.00 loops=1)
    Filter: ((i + 1) = 2)
    Rows Removed by Filter: 999
    Buffers: shared hit=9
  Planning:
    Buffers: shared hit=26
  Planning Time: 1.142 ms
  Execution Time: 3.434 ms
 (8 rows)

Therefore, I would like to allow to collect statistics on virtual enerated 
columns.

I think there are at least three approaches for this.

(1) Allow the normal ANALYZE to collect statistics on virtual generated columns

ANALYZE expands virtual generated columns' expression, and collects statistics
on evaluated values. In this approach, the statistics on virtual generated 
columns
are collected in default, but ANALYZE on the table would become a bit expensive.

(2) Allow to create an index on virtual generated column

This is proposed in [1]. This proposal itself would be useful, I believe it is 
better
to provide a way to collect statistics without cost of creating an index.

[1] 
https://www.postgresql.org/message-id/flat/CACJufxGao-cypdNhifHAdt8jHfK6-HX=trbovbkgruxw063...@mail.gmail.com

(3) Allow to create extended statistics on virtual generated columns

In this approach, ANALYZE processes virtual generated columns only if 
corresponding
extended statistics are defined. Although we can create extended statistics on
expressions of virtual generated columns even in the current implementation, 
this enables
that users to create a useful statistics this just by specifying a column name 
without
specifying complex expression.

I can also think of two variations for this approach.

(3a)
At the timing when an extended statistics is created, virtual generated columns 
are
expanded, and the statistics is defined on this expression.

(3b)
At the timing when an extended statistics is created, virtual generated columns 
are
NOT expanded. The statistics is defined on the virtual generated column itself 
and,
the expression is expanded when ANALYZE processes the extended statistics.

I've attached a draft patch based on (3a).  However, if it is possible we could 
change
the definition of generated columns in future (as proposed in [2]), (3b) might 
be preferred.  

[2] 
https://www.postgresql.org/message-id/flat/cacjufxh3vetr7orf5rw29gndk3n1wwboe3wdkhyd3ipgrq9...@mail.gmail.com

Here is an example of how the patch works.

[Ex.2]

 test=# CREATE STATISTICS exstat ON v FROM t;
 CREATE STATISTICS
 test=# ANALYZE t;
 ANALYZE
 test=# EXPLAIN ANALYZE SELECT * FROM t WHERE v = 2;
                                              QUERY PLAN                        
                      
 
-----------------------------------------------------------------------------------------------------
  Seq Scan on t  (cost=0.00..41.50 rows=1001 width=8) (actual time=0.067..2.422 
rows=1001.00 loops=1)
    Filter: ((i + 1) = 2)
    Rows Removed by Filter: 999
    Buffers: shared hit=9
  Planning:
    Buffers: shared hit=14
  Planning Time: 0.785 ms
  Execution Time: 2.744 ms
 (8 rows)


What do you think of this?  Which approach of (1), (3a), or (3b) is good?
Or, completely different approach is better? 
With your feedback, I would like to progress or rework the patch.

Regards,
Yugo Nagata

-- 
Yugo Nagata <nag...@sraoss.co.jp>
>From a6b0be714f6d4e4e0e7423f07432d3135c807a63 Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nag...@sraoss.co.jp>
Date: Tue, 22 Apr 2025 17:03:50 +0900
Subject: [PATCH v1] Allow to create extended statistics on virtual generated
 columns

---
 src/backend/commands/statscmds.c | 86 +++++++++++++++-----------------
 1 file changed, 40 insertions(+), 46 deletions(-)

diff --git a/src/backend/commands/statscmds.c b/src/backend/commands/statscmds.c
index e24d540cd45..9b7f27fec28 100644
--- a/src/backend/commands/statscmds.c
+++ b/src/backend/commands/statscmds.c
@@ -29,6 +29,7 @@
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/optimizer.h"
+#include "rewrite/rewriteHandler.h"
 #include "statistics/statistics.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
@@ -240,28 +241,27 @@ CreateStatistics(CreateStatsStmt *stmt)
 								attname)));
 			attForm = (Form_pg_attribute) GETSTRUCT(atttuple);
 
-			/* Disallow use of system attributes in extended stats */
-			if (attForm->attnum <= 0)
-				ereport(ERROR,
-						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-						 errmsg("statistics creation on system columns is not supported")));
-
-			/* Disallow use of virtual generated columns in extended stats */
 			if (attForm->attgenerated == ATTRIBUTE_GENERATED_VIRTUAL)
-				ereport(ERROR,
-						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-						 errmsg("statistics creation on virtual generated columns is not supported")));
+				selem->expr = build_generation_expression(rel, attForm->attnum);
+			else
+			{
+				/* Disallow use of system attributes in extended stats */
+				if (attForm->attnum <= 0)
+					ereport(ERROR,
+							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("statistics creation on system columns is not supported")));
 
-			/* Disallow data types without a less-than operator */
-			type = lookup_type_cache(attForm->atttypid, TYPECACHE_LT_OPR);
-			if (type->lt_opr == InvalidOid)
-				ereport(ERROR,
-						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-						 errmsg("column \"%s\" cannot be used in statistics because its type %s has no default btree operator class",
-								attname, format_type_be(attForm->atttypid))));
+				/* Disallow data types without a less-than operator */
+				type = lookup_type_cache(attForm->atttypid, TYPECACHE_LT_OPR);
+				if (type->lt_opr == InvalidOid)
+					ereport(ERROR,
+							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("column \"%s\" cannot be used in statistics because its type %s has no default btree operator class",
+									attname, format_type_be(attForm->atttypid))));
 
-			attnums[nattnums] = attForm->attnum;
-			nattnums++;
+				attnums[nattnums] = attForm->attnum;
+				nattnums++;
+			}
 			ReleaseSysCache(atttuple);
 		}
 		else if (IsA(selem->expr, Var)) /* column reference in parens */
@@ -269,30 +269,28 @@ CreateStatistics(CreateStatsStmt *stmt)
 			Var		   *var = (Var *) selem->expr;
 			TypeCacheEntry *type;
 
-			/* Disallow use of system attributes in extended stats */
-			if (var->varattno <= 0)
-				ereport(ERROR,
-						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-						 errmsg("statistics creation on system columns is not supported")));
-
-			/* Disallow use of virtual generated columns in extended stats */
-			if (get_attgenerated(relid, var->varattno) == ATTRIBUTE_GENERATED_VIRTUAL)
-				ereport(ERROR,
-						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-						 errmsg("statistics creation on virtual generated columns is not supported")));
+			if (get_attgenerated(relid, var->varattno) != ATTRIBUTE_GENERATED_VIRTUAL)
+			{
+				/* Disallow use of system attributes in extended stats */
+				if (var->varattno <= 0)
+					ereport(ERROR,
+							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("statistics creation on system columns is not supported")));
 
-			/* Disallow data types without a less-than operator */
-			type = lookup_type_cache(var->vartype, TYPECACHE_LT_OPR);
-			if (type->lt_opr == InvalidOid)
-				ereport(ERROR,
-						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-						 errmsg("column \"%s\" cannot be used in statistics because its type %s has no default btree operator class",
-								get_attname(relid, var->varattno, false), format_type_be(var->vartype))));
+				/* Disallow data types without a less-than operator */
+				type = lookup_type_cache(var->vartype, TYPECACHE_LT_OPR);
+				if (type->lt_opr == InvalidOid)
+					ereport(ERROR,
+							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("column \"%s\" cannot be used in statistics because its type %s has no default btree operator class",
+									get_attname(relid, var->varattno, false), format_type_be(var->vartype))));
 
-			attnums[nattnums] = var->varattno;
-			nattnums++;
+				attnums[nattnums] = var->varattno;
+				nattnums++;
+			}
 		}
-		else					/* expression */
+
+		if (selem->expr)
 		{
 			Node	   *expr = selem->expr;
 			Oid			atttype;
@@ -302,6 +300,8 @@ CreateStatistics(CreateStatsStmt *stmt)
 
 			Assert(expr != NULL);
 
+			expr = expand_generated_columns_in_expr(expr, rel, 1);
+
 			pull_varattnos(expr, 1, &attnums);
 
 			k = -1;
@@ -314,12 +314,6 @@ CreateStatistics(CreateStatsStmt *stmt)
 					ereport(ERROR,
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("statistics creation on system columns is not supported")));
-
-				/* Disallow use of virtual generated columns in extended stats */
-				if (get_attgenerated(relid, attnum) == ATTRIBUTE_GENERATED_VIRTUAL)
-					ereport(ERROR,
-							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-							 errmsg("statistics creation on virtual generated columns is not supported")));
 			}
 
 			/*
-- 
2.34.1

Reply via email to