Re: [HACKERS] More stable query plans via more predictable column statistics

Tomas Vondra Sat, 23 Jan 2016 02:23:36 -0800

Hi,

On 01/20/2016 10:49 PM, Alvaro Herrera wrote:

Tom Lane wrote:

"Shulgin, Oleksandr" <oleksandr.shul...@zalando.de> writes:

This post summarizes a few weeks of research of ANALYZE statistics
distribution on one of our bigger production databases with some real-world
data and proposes a patch to rectify some of the oddities observed.


Please add this to the 2016-01 commitfest ...


Tom, are you reviewing this for the current commitfest?

While I'm not the right Tom, I've been looking the the patch recently,so let me post the review here ...

Firstly, I'd like to appreciate the level of detail of the analysis. Imay disagree with some of the conclusions, but I wish all my patchsubmissions were of such high quality.

Regarding the patch itself, I think there's a few different pointsraised, so let me discuss them one by one:



1) NULLs vs. MCV threshold
--------------------------

I agree that this seems like a bug, and that we should really computethe threshold only using non-NULL values. I think the analysis ratherconclusively proves this, and I also think there are places where we dothe same mistake (more on that at the end of the review).



2) mincount = 1.25 * avgcount
-----------------------------

While I share the dislike of arbitrary constants (like the 1.25 here), Ido think we better keep this, otherwise we can just get rid of themincount entirely I think - the most frequent value will always be abovethe (remaining) average count, making the threshold pointless.

It might have impact in the original code, but in the new one it's quiteuseless (see the next point), unless I'm missing some detail.



3) modifying avgcount threshold inside the loop
-----------------------------------------------

The comment was extended with this statement:

 * We also decrease ndistinct in the process such that going forward
 * it refers to the number of distinct values left for the histogram.

and indeed, that's what's happening - at the end of each loop, we do this:

    /* Narrow our view of samples left for the histogram */
    sample_cnt -= track[i].count;
    ndistinct--;

but that immediately lowers the avgcount, as that's re-evaluated withinthe same loop


    avgcount = (double) sample_cnt / (double) ndistinct;

which means it's continuously decreasing and lowering the threshold,although that's partially mitigated by keeping the 1.25 coefficient.


It however makes reasoning about the algorithm much more complicated.


4) for (i = 0; /* i < num_mcv */; i++)
---------------------------------------

The way the loop is coded seems rather awkward, I guess. Not onlythere's an unexpected comment in the "for" clause, but the conditionalso says this


    /* Another way to say "while (i < num_mcv)" */
    if (i >= num_mcv)
        break;

Why not to write it as a while loop, then? Possibly including the(num_hist >= 2) condition, like this:


    while ((i < num_mcv) && (num_hist >= 2))
    {
        ...
    }

In any case, the loop would deserve a comment explaining why we thinkcomputing the thresholds like this makes sense.



Summary
-------

Overall, I think this is really about deciding when to cut-off the MCV,so that it does not grow needlessly large - as Robert pointed out, thelarger the list, the more expensive the estimation (and thus planning).

So even if we could fit the whole sample into the MCV list (i.e. webelieve we've seen all the values and we can fit them into the MCVlist), it may not make sense to do so. The ultimate goal is to estimateconditions, and if we can do that reasonably even after cutting of theleast frequent values from the MCV list, then why not?

From this point of view, the analysis concentrates deals just with theANALYZE part and does not discuss the estimation counter-part at all.



5) ndistinct estimation vs. NULLs
---------------------------------

While looking at the patch, I started realizing whether we're actuallyhandling NULLs correctly when estimating ndistinct. Because that partalso uses samplerows directly and entirely ignores NULLs, as it does this:


    numer = (double) samplerows *(double) d;

    denom = (double) (samplerows - f1) +
        (double) f1 *(double) samplerows / totalrows;

    ...
    if (stadistinct > totalrows)
        stadistinct = totalrows;

For tables with large fraction of NULLs, this seems to significantlyunderestimate the ndistinct value - for example consider a trivial tablewith 95% of NULL values and ~10k distinct values with skewed distribution:


    create table t (id int);

    insert into t
    select (case when random() < 0.05 then (10000 * random() * random())
                 else null end) from generate_series(1,1000000) s(i);

In practice, there are 8325 distinct values in my sample:

    test=# select count(distinct id) from t;
     count
    -------
      8325
    (1 row)

But after ANALYZE with default statistics target (100), ndistinct isestimated to be ~1300, and with target=1000 the estimate increases to ~6000.

After fixing the estimator to consider fraction of NULLs, the estimateslook like this:


    statistics target |   master  |  patched
   ------------------------------------------
                  100 |     1302  |     5356
                 1000 |     6022  |     6791

So this seems to significantly improve the ndistinct estimate (patchattached).


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 070df29..7eb51e2 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -2428,17 +2428,21 @@ compute_scalar_stats(VacAttrStatsP stats,
 						denom,
 						stadistinct;
 
-			numer = (double) samplerows *(double) d;
+			double		samplerows_nonnull = samplerows - null_cnt;
+			double		totalrows_nonnull
+							= totalrows * (1.0 - stats->stanullfrac);
 
-			denom = (double) (samplerows - f1) +
-				(double) f1 *(double) samplerows / totalrows;
+			numer = samplerows_nonnull * (double) d;
+
+			denom = (samplerows_nonnull - f1) +
+				(double) f1 * samplerows_nonnull / totalrows_nonnull;
 
 			stadistinct = numer / denom;
 			/* Clamp to sane range in case of roundoff error */
 			if (stadistinct < (double) d)
 				stadistinct = (double) d;
-			if (stadistinct > totalrows)
-				stadistinct = totalrows;
+			if (stadistinct > totalrows_nonnull)
+				stadistinct = totalrows_nonnull;
 			stats->stadistinct = floor(stadistinct + 0.5);
 		}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More stable query plans via more predictable column statistics

Reply via email to