Hi,

It seems to me that since the pg_stats view is supposed to be
human-readable, it would make sense to show a human-readable version
of n_distinct.
Currently, when the stats collector estimates that the number of
distinct values is more than 10% of the total row count, what is
stored in pg_statistic.stadistinct is -1 * n_distinct / totalrows, the
rationale being that if new rows are inserted in the table, they are
likely to introduce new values, and storing that value allows the
stadistinct not to get stale too fast.

You can find attached a simple WIP patch to show the proper n_distinct
value in pg_stats.

* Is this desired?
* Would it make sense to add a column in the pg_stats view to display
the information "lost", that is the fact that postgres will assume
that inserting new rows means a higher n_distinct?
* Am I right to assume that totalrows in the code
(src/backend/commands/analyze.c:2170) actually corresponds to
n_live_tup? That's what I gathered from glancing at the code, but I
might be wrong.
* Should the catalog version be changed for this kind of change?
* Should I add this patch to the commitfest?

If this patch is actually desired, I'll update the documentation as well.
I'm guessing this patch would break scripts relying on the pg_stats
view, but I do not know how much we want to avoid that, since they
should rely on the base tables rather than on the views.

Thanks in advance for your input!

Regards,
Maxence Ahlouche
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 7723f01327..d0fd74075a 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -194,7 +194,10 @@ CREATE VIEW pg_stats WITH (security_barrier) AS
         stainherit AS inherited,
         stanullfrac AS null_frac,
         stawidth AS avg_width,
-        stadistinct AS n_distinct,
+        CASE
+            WHEN stadistinct >= 0 THEN stadistinct
+            ELSE -1 * stadistinct * pg_stat_get_live_tuples(c.oid)
+        END AS n_distinct,
         CASE
             WHEN stakind1 = 1 THEN stavalues1
             WHEN stakind2 = 1 THEN stavalues2
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e0f2c543ef..e3dba42aef 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2142,7 +2142,10 @@ pg_stats| SELECT n.nspname AS schemaname,
     s.stainherit AS inherited,
     s.stanullfrac AS null_frac,
     s.stawidth AS avg_width,
-    s.stadistinct AS n_distinct,
+        CASE
+            WHEN (s.stadistinct >= (0)::double precision) THEN (s.stadistinct)::double precision
+            ELSE ((('-1'::integer)::double precision * s.stadistinct) * (pg_stat_get_live_tuples(c.oid))::double precision)
+        END AS n_distinct,
         CASE
             WHEN (s.stakind1 = 1) THEN s.stavalues1
             WHEN (s.stakind2 = 1) THEN s.stavalues2

Reply via email to