Hello Robert, Sami, Ilia, and everyone,

Thank you all for the detailed review and thoughtful feedback. I have
carefully gone through the comments and revised the patch accordingly. I
truly appreciate the guidance — it helped clarify both the design and the
implementation details.

Regarding Robert’s point about the two distinct use cases, I agree that
MISSING_STATS_ONLY and MODIFIED_STATS represent separate concerns.
MISSING_STATS_ONLY is catalog-driven and persistent in nature, ensuring
that newly created tables, newly added columns, and newly defined extended
statistics are not left without statistics. MODIFIED_STATS, on the other
hand, is more closely aligned with modification thresholds and
autoanalyze-like behavior, which is transient and threshold-based.

Keeping these concerns separate makes the semantics clearer and easier to
reason about. In particular, as Robert mentioned, MISSING_STATS_ONLY must
remain a standalone flag for scenarios such as integration into vacuumdb,
where ensuring that missing statistics are generated is the primary goal.

To align with the CommitFest process, I have created a separate entry for
this feature here:[1]

I will treat MODIFIED_STATS as a separate patch and discussion thread.

Addressing Sami’s technical comments:

1. Duplicate examine_attribute() calls

You were absolutely correct that the earlier version of the patch caused
examine_attribute() to be invoked twice for each attribute: once in
relation_has_missing_column_stats() and again during normal attribute
processing. That approach was inefficient and awkward.
In the revised version, I have removed the early relation-level column
scanning logic. Instead, the missing-statistics check is now integrated
directly inside examine_attribute(). The function now takes a boolean
missing_stats_only parameter. When this option is specified,
examine_attribute() performs the pg_statistic lookup for that specific
attribute. If statistics already exist, the function simply returns NULL,
causing the column to be skipped naturally.

This ensures:
-Each attribute is examined only once.
-The missing-statistics logic fits naturally into the existing flow.
-We avoid redundant catalog lookups.

Inside do_analyze_rel(), after collecting vacattrstats, we now check
whether:
-attr_cnt == 0
-no expression index attributes remain
-and there are no missing extended statistics
If all of these are true under MISSING_STATS_ONLY, the relation is skipped
cleanly.This follows the structure you suggested and simplifies the overall
design.

2. Test coverage

I have expanded the regression tests to cover the scenarios you
mentioned.This ensures that MISSING_STATS_ONLY behaves correctly across
more complex schema configurations.

3. Logging behavior

Thank you for the guidance on logging.I have removed the DEBUG-level elog()
calls introduced in earlier versions. Logging is now consistent with
existing ANALYZE behavior and uses ereport().
When VERBOSE mode is enabled and a relation is skipped due to
MISSING_STATS_ONLY, the output now includes a single INFO-level message of
the form:

INFO: Skipping analyzing "database.namespace.relation"

This matches the style used elsewhere in do_analyze_rel() and avoids
unnecessary additional details when no statistics are collected.There is no
additional logging when a relation is processed normally, keeping behavior
aligned with existing ANALYZE semantics.


Regarding autoanalyze and MODIFIED_STATS
As Ilia clarified, autoanalyze decisions are purely threshold-driven and do
not include missing-statistics logic. That reinforces the decision to keep
MISSING_STATS_ONLY separate and explicit.[2]

If there are further suggestions regarding semantics, naming, or additional
edge cases that should be tested, I would be very happy to incorporate them.
Thank you again for your time and detailed review.

Regards,
Vasuki M
C-DAC,Chennai

[1] https://commitfest.postgresql.org/patch/6516/
[2]
https://www.postgresql.org/message-id/flat/aZSm77WEh8pxQYtf%40nathan#1ab5c06a7d2247d90e71fb995fa21a39

>
From f05d3af46c096aaa02271ff9915a9e4a35a72e60 Mon Sep 17 00:00:00 2001
From: Vasuki M <[email protected]>
Date: Mon, 2 Mar 2026 17:59:16 +0530
Subject: [PATCH v6] ANALYZE: add MISSING_STATS_ONLY option

Introduce a new ANALYZE (MISSING_STATS_ONLY) option that causes ANALYZE
to process only relations that are missing column or extended
statistics.

When this option is specified, ANALYZE skips columns that already have
entries in pg_statistic and processes only attributes for which
statistics are absent. If all columns and extended statistics objects
already have statistics, the relation is skipped entirely.

This allows repeated manual ANALYZE runs to avoid unnecessary work
while still analyzing:

- newly created tables,
- tables that have gained data but lack statistics,
- relations where new columns were added,
- relations with missing extended statistics.

The default ANALYZE behavior remains unchanged.

Regression tests are included to cover:

- empty tables,
- tables gaining data,
- repeated runs,
- explicit column lists,
- expression indexes,
- declarative partitioning,
- table inheritance.

This patch refines earlier versions by moving the missing-statistics
logic into examine_attribute() to ensure that column-level decisions
are handled consistently within the existing analysis flow and ensuring
correct handling of inheritance and partitioned tables.
---
 src/backend/commands/analyze.c                | 134 +++++++++++++++++-
 src/backend/commands/vacuum.c                 |   6 +-
 src/include/commands/vacuum.h                 |   1 +
 .../expected/analyze_missing_stats_only.out   |  97 +++++++++++++
 src/test/regress/parallel_schedule            |   1 +
 .../sql/analyze_missing_stats_only.sql        | 109 ++++++++++++++
 6 files changed, 342 insertions(+), 6 deletions(-)
 create mode 100644 src/test/regress/expected/analyze_missing_stats_only.out
 create mode 100644 src/test/regress/sql/analyze_missing_stats_only.sql

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 53adac91..928ede4f 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -29,6 +29,8 @@
 #include "catalog/index.h"
 #include "catalog/indexing.h"
 #include "catalog/pg_inherits.h"
+#include "catalog/pg_statistic_ext.h"
+#include "catalog/pg_statistic_ext_data.h"
 #include "commands/progress.h"
 #include "commands/tablecmds.h"
 #include "commands/vacuum.h"
@@ -46,6 +48,7 @@
 #include "storage/procarray.h"
 #include "utils/attoptcache.h"
 #include "utils/datum.h"
+#include "utils/fmgroids.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -83,7 +86,7 @@ static void compute_index_stats(Relation onerel, double totalrows,
 								HeapTuple *rows, int numrows,
 								MemoryContext col_context);
 static VacAttrStats *examine_attribute(Relation onerel, int attnum,
-									   Node *index_expr);
+									   Node *index_expr,bool missing_stats_only);
 static int	acquire_sample_rows(Relation onerel, int elevel,
 								HeapTuple *rows, int targrows,
 								double *totalrows, double *totaldeadrows);
@@ -96,6 +99,54 @@ static void update_attstats(Oid relid, bool inh,
 static Datum std_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
 static Datum ind_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
 
+static bool
+relation_has_missing_extended_stats(Relation rel)
+{
+	Relation extrel;
+	SysScanDesc scan;
+	ScanKeyData key;
+	HeapTuple tup;
+
+	extrel = table_open(StatisticExtRelationId, AccessShareLock);
+
+	ScanKeyInit(&key,
+			Anum_pg_statistic_ext_stxrelid,
+			BTEqualStrategyNumber,
+			F_OIDEQ,
+			ObjectIdGetDatum(RelationGetRelid(rel)));
+
+	scan = systable_beginscan(extrel,
+					StatisticExtRelidIndexId,
+					true,
+					NULL,
+					1,
+					&key);
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_statistic_ext e =
+			(Form_pg_statistic_ext) GETSTRUCT(tup);
+
+		HeapTuple dtup =
+			SearchSysCache2(STATEXTDATASTXOID,
+				ObjectIdGetDatum(e->oid),
+				BoolGetDatum(false));
+
+		if (!HeapTupleIsValid(dtup))
+		{
+			systable_endscan(scan);
+			table_close(extrel, AccessShareLock);
+			return true;
+		}
+
+		ReleaseSysCache(dtup);
+	}
+
+	systable_endscan(scan);
+	table_close(extrel, AccessShareLock);
+
+	return false;
+}
 
 /*
  *	analyze_rel() -- analyze one relation
@@ -399,7 +450,7 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
 								col, RelationGetRelationName(onerel))));
 			unique_cols = bms_add_member(unique_cols, i);
 
-			vacattrstats[tcnt] = examine_attribute(onerel, i, NULL);
+			vacattrstats[tcnt] = examine_attribute(onerel, i, NULL,(params.options & VACOPT_MISSING_STATS_ONLY));
 			if (vacattrstats[tcnt] != NULL)
 				tcnt++;
 		}
@@ -413,7 +464,7 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
 		tcnt = 0;
 		for (i = 1; i <= attr_cnt; i++)
 		{
-			vacattrstats[tcnt] = examine_attribute(onerel, i, NULL);
+			vacattrstats[tcnt] = examine_attribute(onerel, i, NULL,(params.options & VACOPT_MISSING_STATS_ONLY));
 			if (vacattrstats[tcnt] != NULL)
 				tcnt++;
 		}
@@ -482,7 +533,7 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
 						indexpr_item = lnext(indexInfo->ii_Expressions,
 											 indexpr_item);
 						thisdata->vacattrstats[tcnt] =
-							examine_attribute(Irel[ind], i + 1, indexkey);
+							examine_attribute(Irel[ind], i + 1, indexkey,(params.options & VACOPT_MISSING_STATS_ONLY));
 						if (thisdata->vacattrstats[tcnt] != NULL)
 							tcnt++;
 					}
@@ -492,6 +543,40 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
 		}
 	}
 
+	/*
+	 * If ANALYZE (MISSING_STATS_ONLY) and nothing to analyze,
+	 * skip relation.
+	 */
+	if ((params.options & VACOPT_MISSING_STATS_ONLY) &&
+		attr_cnt == 0)
+	{
+		bool has_missing_ext = relation_has_missing_extended_stats(onerel);
+
+		bool has_expr = false;
+
+		for (ind = 0; ind < nindexes; ind++)
+		{
+			if (indexdata[ind].attr_cnt > 0)
+			{
+				has_expr = true;
+				break;
+			}
+		}
+
+		if (!has_expr && !has_missing_ext)
+		{
+			if (verbose)
+				ereport(INFO,
+					(errmsg("skipping analyzing \"%s.%s.%s\"",
+						get_database_name(MyDatabaseId),
+						get_namespace_name(RelationGetNamespace(onerel)),
+						RelationGetRelationName(onerel))));
+
+			vac_close_indexes(nindexes, Irel, NoLock);
+			return;
+		}
+	}
+
 	/*
 	 * Determine how many rows we need to sample, using the worst case from
 	 * all analyzable columns.  We use a lower bound of 100 rows to avoid
@@ -1039,7 +1124,7 @@ compute_index_stats(Relation onerel, double totalrows,
  * and index_expr is the expression tree representing the column's data.
  */
 static VacAttrStats *
-examine_attribute(Relation onerel, int attnum, Node *index_expr)
+examine_attribute(Relation onerel, int attnum, Node *index_expr, bool missing_stats_only)
 {
 	Form_pg_attribute attr = TupleDescAttr(onerel->rd_att, attnum - 1);
 	int			attstattarget;
@@ -1076,6 +1161,45 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 	if (attstattarget == 0)
 		return NULL;
 
+	/*
+	 * If ANALYZE (MISSING_STATS_ONLY),
+	 * skip attributes that already have stats.
+	 */
+	if (missing_stats_only)
+	{
+		HeapTuple statstup;
+		bool has_stats = false;
+
+		/* Try stainherit = false */
+		statstup = SearchSysCache3(STATRELATTINH,
+						ObjectIdGetDatum(RelationGetRelid(onerel)),
+						Int16GetDatum(attnum),
+						BoolGetDatum(false));
+
+		if (HeapTupleIsValid(statstup))
+		{
+			has_stats = true;
+			ReleaseSysCache(statstup);
+		}
+		else
+		{
+			/* Try stainherit = true (partition parents) */
+			statstup = SearchSysCache3(STATRELATTINH,
+							ObjectIdGetDatum(RelationGetRelid(onerel)),
+							Int16GetDatum(attnum),
+							BoolGetDatum(true));
+
+			if (HeapTupleIsValid(statstup))
+			{
+				has_stats = true;
+				ReleaseSysCache(statstup);
+			}
+		}
+		if (has_stats)
+			return NULL;   /* skip attribute */
+
+	}
+
 	/*
 	 * Create the VacAttrStats struct.
 	 */
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 03932f45..d031f89b 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -165,6 +165,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 	BufferAccessStrategy bstrategy = NULL;
 	bool		verbose = false;
 	bool		skip_locked = false;
+	bool		missing_stats_only = false;
 	bool		analyze = false;
 	bool		freeze = false;
 	bool		full = false;
@@ -229,6 +230,8 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 
 			ring_size = result;
 		}
+		else if (strcmp(opt->defname, "missing_stats_only") == 0)
+			missing_stats_only = defGetBoolean(opt);
 		else if (!vacstmt->is_vacuumcmd)
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -305,6 +308,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		(vacstmt->is_vacuumcmd ? VACOPT_VACUUM : VACOPT_ANALYZE) |
 		(verbose ? VACOPT_VERBOSE : 0) |
 		(skip_locked ? VACOPT_SKIP_LOCKED : 0) |
+		(missing_stats_only ? VACOPT_MISSING_STATS_ONLY :0) |
 		(analyze ? VACOPT_ANALYZE : 0) |
 		(freeze ? VACOPT_FREEZE : 0) |
 		(full ? VACOPT_FULL : 0) |
@@ -315,7 +319,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		(only_database_stats ? VACOPT_ONLY_DATABASE_STATS : 0);
 
 	/* sanity checks on options */
-	Assert(params.options & (VACOPT_VACUUM | VACOPT_ANALYZE));
+	Assert(params.options & (VACOPT_VACUUM | VACOPT_ANALYZE | VACOPT_MISSING_STATS_ONLY));
 	Assert((params.options & VACOPT_VACUUM) ||
 		   !(params.options & (VACOPT_FULL | VACOPT_FREEZE)));
 
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e885a4b9..4ca45fe6 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -188,6 +188,7 @@ typedef struct VacAttrStats
 #define VACOPT_DISABLE_PAGE_SKIPPING 0x100	/* don't skip any pages */
 #define VACOPT_SKIP_DATABASE_STATS 0x200	/* skip vac_update_datfrozenxid() */
 #define VACOPT_ONLY_DATABASE_STATS 0x400	/* only vac_update_datfrozenxid() */
+#define VACOPT_MISSING_STATS_ONLY	0x800	/* ANALYZE if stats are missing */
 
 /*
  * Values used by index_cleanup and truncate params.
diff --git a/src/test/regress/expected/analyze_missing_stats_only.out b/src/test/regress/expected/analyze_missing_stats_only.out
new file mode 100644
index 00000000..cde00dc1
--- /dev/null
+++ b/src/test/regress/expected/analyze_missing_stats_only.out
@@ -0,0 +1,97 @@
+--
+-- ANALYZE (MISSING_STATS_ONLY) regression test
+--
+-- 1. Brand new empty table
+CREATE TABLE ms1 (a int);
+-- No statistics yet → should analyze
+ANALYZE (MISSING_STATS_ONLY) ms1;
+-- pg_statistic should still be empty (table has no rows)
+SELECT count(*) AS stat_rows
+FROM pg_statistic
+WHERE starelid = 'ms1'::regclass;
+ stat_rows 
+-----------
+         0
+(1 row)
+
+-- 2. Insert data
+INSERT INTO ms1 SELECT generate_series(1,10);
+-- Stats missing → should analyze
+ANALYZE (MISSING_STATS_ONLY) ms1;
+-- Now stats must exist
+SELECT count(*) AS stat_rows
+FROM pg_statistic
+WHERE starelid = 'ms1'::regclass;
+ stat_rows 
+-----------
+         1
+(1 row)
+
+-- 3. Re-run → stats exist → should skip
+ANALYZE (MISSING_STATS_ONLY) ms1;
+-- Stats count should remain the same
+SELECT count(*) AS stat_rows
+FROM pg_statistic
+WHERE starelid = 'ms1'::regclass;
+ stat_rows 
+-----------
+         1
+(1 row)
+
+-- 4. Add new column → missing stats for column b
+ALTER TABLE ms1 ADD COLUMN b int;
+-- Should analyze again
+ANALYZE (MISSING_STATS_ONLY) ms1;
+-- Now both columns should have stats
+SELECT count(*) AS stat_rows
+FROM pg_statistic
+WHERE starelid = 'ms1'::regclass;
+ stat_rows 
+-----------
+         2
+(1 row)
+
+DROP TABLE ms1;
+-- 5. ANALYZE with explicit column list
+CREATE TABLE ms_cols (a int, b int);
+INSERT INTO ms_cols SELECT i, i FROM generate_series(1,10) i;
+ANALYZE (MISSING_STATS_ONLY) ms_cols;
+-- Re-run for column a only → should skip
+ANALYZE (MISSING_STATS_ONLY) ms_cols (a);
+SELECT count(*) AS stat_rows
+FROM pg_statistic
+WHERE starelid = 'ms_cols'::regclass;
+ stat_rows 
+-----------
+         2
+(1 row)
+
+DROP TABLE ms_cols;
+-- 6. Expression index
+CREATE TABLE ms_expr (a int);
+INSERT INTO ms_expr SELECT generate_series(1,10);
+CREATE INDEX ms_expr_idx ON ms_expr ((a + 1));
+ANALYZE (MISSING_STATS_ONLY) ms_expr;
+-- Re-run → should skip
+ANALYZE (MISSING_STATS_ONLY) ms_expr;
+DROP TABLE ms_expr;
+-- 7. Declarative partitioning
+CREATE TABLE ms_part (a int) PARTITION BY RANGE (a);
+CREATE TABLE ms_part_1 PARTITION OF ms_part
+    FOR VALUES FROM (0) TO (100);
+INSERT INTO ms_part SELECT generate_series(1,10);
+-- Should analyze parent and partition correctly
+ANALYZE (MISSING_STATS_ONLY) ms_part;
+-- Re-run → should skip
+ANALYZE (MISSING_STATS_ONLY) ms_part;
+DROP TABLE ms_part;
+-- 8. Table inheritance
+CREATE TABLE ms_parent (a int);
+CREATE TABLE ms_child () INHERITS (ms_parent);
+INSERT INTO ms_child SELECT generate_series(1,10);
+-- Analyze parent with inheritance
+ANALYZE (MISSING_STATS_ONLY) ms_parent;
+-- Re-run → should skip
+ANALYZE (MISSING_STATS_ONLY) ms_parent;
+DROP TABLE ms_child;
+DROP TABLE ms_parent;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 549e9b2d..a64be382 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -90,6 +90,7 @@ test: rules psql psql_crosstab psql_pipeline amutils stats_ext collate.linux.utf
 test: select_parallel
 test: write_parallel
 test: vacuum_parallel
+test: analyze_missing_stats_only
 
 # Run this alone, because concurrent DROP TABLE would make non-superuser
 # "ANALYZE;" fail with "relation with OID $n does not exist".
diff --git a/src/test/regress/sql/analyze_missing_stats_only.sql b/src/test/regress/sql/analyze_missing_stats_only.sql
new file mode 100644
index 00000000..d34f1687
--- /dev/null
+++ b/src/test/regress/sql/analyze_missing_stats_only.sql
@@ -0,0 +1,109 @@
+--
+-- ANALYZE (MISSING_STATS_ONLY) regression test
+--
+
+-- 1. Brand new empty table
+CREATE TABLE ms1 (a int);
+
+-- No statistics yet → should analyze
+ANALYZE (MISSING_STATS_ONLY) ms1;
+
+-- pg_statistic should still be empty (table has no rows)
+SELECT count(*) AS stat_rows
+FROM pg_statistic
+WHERE starelid = 'ms1'::regclass;
+
+-- 2. Insert data
+INSERT INTO ms1 SELECT generate_series(1,10);
+
+-- Stats missing → should analyze
+ANALYZE (MISSING_STATS_ONLY) ms1;
+
+-- Now stats must exist
+SELECT count(*) AS stat_rows
+FROM pg_statistic
+WHERE starelid = 'ms1'::regclass;
+
+-- 3. Re-run → stats exist → should skip
+ANALYZE (MISSING_STATS_ONLY) ms1;
+
+-- Stats count should remain the same
+SELECT count(*) AS stat_rows
+FROM pg_statistic
+WHERE starelid = 'ms1'::regclass;
+
+-- 4. Add new column → missing stats for column b
+ALTER TABLE ms1 ADD COLUMN b int;
+
+-- Should analyze again
+ANALYZE (MISSING_STATS_ONLY) ms1;
+
+-- Now both columns should have stats
+SELECT count(*) AS stat_rows
+FROM pg_statistic
+WHERE starelid = 'ms1'::regclass;
+
+DROP TABLE ms1;
+
+-- 5. ANALYZE with explicit column list
+
+CREATE TABLE ms_cols (a int, b int);
+INSERT INTO ms_cols SELECT i, i FROM generate_series(1,10) i;
+
+ANALYZE (MISSING_STATS_ONLY) ms_cols;
+
+-- Re-run for column a only → should skip
+ANALYZE (MISSING_STATS_ONLY) ms_cols (a);
+
+SELECT count(*) AS stat_rows
+FROM pg_statistic
+WHERE starelid = 'ms_cols'::regclass;
+
+DROP TABLE ms_cols;
+
+-- 6. Expression index
+
+CREATE TABLE ms_expr (a int);
+INSERT INTO ms_expr SELECT generate_series(1,10);
+
+CREATE INDEX ms_expr_idx ON ms_expr ((a + 1));
+
+ANALYZE (MISSING_STATS_ONLY) ms_expr;
+
+-- Re-run → should skip
+ANALYZE (MISSING_STATS_ONLY) ms_expr;
+
+DROP TABLE ms_expr;
+
+-- 7. Declarative partitioning
+
+CREATE TABLE ms_part (a int) PARTITION BY RANGE (a);
+
+CREATE TABLE ms_part_1 PARTITION OF ms_part
+    FOR VALUES FROM (0) TO (100);
+
+INSERT INTO ms_part SELECT generate_series(1,10);
+
+-- Should analyze parent and partition correctly
+ANALYZE (MISSING_STATS_ONLY) ms_part;
+
+-- Re-run → should skip
+ANALYZE (MISSING_STATS_ONLY) ms_part;
+
+DROP TABLE ms_part;
+
+-- 8. Table inheritance
+
+CREATE TABLE ms_parent (a int);
+CREATE TABLE ms_child () INHERITS (ms_parent);
+
+INSERT INTO ms_child SELECT generate_series(1,10);
+
+-- Analyze parent with inheritance
+ANALYZE (MISSING_STATS_ONLY) ms_parent;
+
+-- Re-run → should skip
+ANALYZE (MISSING_STATS_ONLY) ms_parent;
+
+DROP TABLE ms_child;
+DROP TABLE ms_parent;
-- 
2.43.0

Reply via email to