Hi,

On 01/20/2016 10:49 PM, Alvaro Herrera wrote:
Tom Lane wrote:
"Shulgin, Oleksandr" <oleksandr.shul...@zalando.de> writes:
This post summarizes a few weeks of research of ANALYZE statistics
distribution on one of our bigger production databases with some real-world
data and proposes a patch to rectify some of the oddities observed.

Please add this to the 2016-01 commitfest ...

Tom, are you reviewing this for the current commitfest?

While I'm not the right Tom, I've been looking the the patch recently, so let me post the review here ...

Firstly, I'd like to appreciate the level of detail of the analysis. I may disagree with some of the conclusions, but I wish all my patch submissions were of such high quality.

Regarding the patch itself, I think there's a few different points raised, so let me discuss them one by one:


1) NULLs vs. MCV threshold
--------------------------

I agree that this seems like a bug, and that we should really compute the threshold only using non-NULL values. I think the analysis rather conclusively proves this, and I also think there are places where we do the same mistake (more on that at the end of the review).


2) mincount = 1.25 * avgcount
-----------------------------

While I share the dislike of arbitrary constants (like the 1.25 here), I do think we better keep this, otherwise we can just get rid of the mincount entirely I think - the most frequent value will always be above the (remaining) average count, making the threshold pointless.

It might have impact in the original code, but in the new one it's quite useless (see the next point), unless I'm missing some detail.


3) modifying avgcount threshold inside the loop
-----------------------------------------------

The comment was extended with this statement:

 * We also decrease ndistinct in the process such that going forward
 * it refers to the number of distinct values left for the histogram.

and indeed, that's what's happening - at the end of each loop, we do this:

    /* Narrow our view of samples left for the histogram */
    sample_cnt -= track[i].count;
    ndistinct--;

but that immediately lowers the avgcount, as that's re-evaluated within the same loop

    avgcount = (double) sample_cnt / (double) ndistinct;

which means it's continuously decreasing and lowering the threshold, although that's partially mitigated by keeping the 1.25 coefficient.

It however makes reasoning about the algorithm much more complicated.


4) for (i = 0; /* i < num_mcv */; i++)
---------------------------------------

The way the loop is coded seems rather awkward, I guess. Not only there's an unexpected comment in the "for" clause, but the condition also says this

    /* Another way to say "while (i < num_mcv)" */
    if (i >= num_mcv)
        break;

Why not to write it as a while loop, then? Possibly including the (num_hist >= 2) condition, like this:

    while ((i < num_mcv) && (num_hist >= 2))
    {
        ...
    }

In any case, the loop would deserve a comment explaining why we think computing the thresholds like this makes sense.


Summary
-------

Overall, I think this is really about deciding when to cut-off the MCV, so that it does not grow needlessly large - as Robert pointed out, the larger the list, the more expensive the estimation (and thus planning).

So even if we could fit the whole sample into the MCV list (i.e. we believe we've seen all the values and we can fit them into the MCV list), it may not make sense to do so. The ultimate goal is to estimate conditions, and if we can do that reasonably even after cutting of the least frequent values from the MCV list, then why not?

From this point of view, the analysis concentrates deals just with the ANALYZE part and does not discuss the estimation counter-part at all.


5) ndistinct estimation vs. NULLs
---------------------------------

While looking at the patch, I started realizing whether we're actually handling NULLs correctly when estimating ndistinct. Because that part also uses samplerows directly and entirely ignores NULLs, as it does this:

    numer = (double) samplerows *(double) d;

    denom = (double) (samplerows - f1) +
        (double) f1 *(double) samplerows / totalrows;

    ...
    if (stadistinct > totalrows)
        stadistinct = totalrows;

For tables with large fraction of NULLs, this seems to significantly underestimate the ndistinct value - for example consider a trivial table with 95% of NULL values and ~10k distinct values with skewed distribution:

    create table t (id int);

    insert into t
    select (case when random() < 0.05 then (10000 * random() * random())
                 else null end) from generate_series(1,1000000) s(i);

In practice, there are 8325 distinct values in my sample:

    test=# select count(distinct id) from t;
     count
    -------
      8325
    (1 row)

But after ANALYZE with default statistics target (100), ndistinct is estimated to be ~1300, and with target=1000 the estimate increases to ~6000.

After fixing the estimator to consider fraction of NULLs, the estimates look like this:

    statistics target |   master  |  patched
   ------------------------------------------
                  100 |     1302  |     5356
                 1000 |     6022  |     6791

So this seems to significantly improve the ndistinct estimate (patch attached).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 070df29..7eb51e2 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -2428,17 +2428,21 @@ compute_scalar_stats(VacAttrStatsP stats,
 						denom,
 						stadistinct;
 
-			numer = (double) samplerows *(double) d;
+			double		samplerows_nonnull = samplerows - null_cnt;
+			double		totalrows_nonnull
+							= totalrows * (1.0 - stats->stanullfrac);
 
-			denom = (double) (samplerows - f1) +
-				(double) f1 *(double) samplerows / totalrows;
+			numer = samplerows_nonnull * (double) d;
+
+			denom = (samplerows_nonnull - f1) +
+				(double) f1 * samplerows_nonnull / totalrows_nonnull;
 
 			stadistinct = numer / denom;
 			/* Clamp to sane range in case of roundoff error */
 			if (stadistinct < (double) d)
 				stadistinct = (double) d;
-			if (stadistinct > totalrows)
-				stadistinct = totalrows;
+			if (stadistinct > totalrows_nonnull)
+				stadistinct = totalrows_nonnull;
 			stats->stadistinct = floor(stadistinct + 0.5);
 		}
 
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to