Re: MCV lists for highly skewed distributions

Dean Rasheed Mon, 05 Mar 2018 06:25:51 -0800

On 7 February 2018 at 15:58, Dean Rasheed <dean.a.rash...@gmail.com> wrote:
> On 7 February 2018 at 15:25, Robert Haas <robertmh...@gmail.com> wrote:
>> Do you plan to press forward with this, then, or what's
>> the next step?
>
> I plan to do more testing


TL;DR: I tested this new algorithm using 9 different relative standard
error cutoffs (10%, 15%, ... 50%) against a variety of different data
distributions, and it looks like a definite improvement over the
current algorithm, for a wide range of cutoff values. Overall, it
isn't that sensitive to the exact cutoff chosen, but it looks like a
value of 20% is a good general-purpose choice.

One of the properties of the new algorithm is that the size of the MCV
list responds better to changing the stats target, so I don't think
it's worth having a separate knob to control the cutoff percentage.
Such a knob would have a similar, but weaker effect than the stats
target knob, and thus doesn't seem worthwhile.

Attached is an updated patch. I decided to move the code that
determines the minimum number of occurrences for an MCV list item out
to a separate function. It's not much code, but there's a large
comment to explain its statistical justification, and I didn't want to
duplicate that in 2 different places (possibly to become 3, with the
multivariate MCV stats patch).

I think this is ready for commit now, so barring objections, I will do
so in the next day or so.

---

I tested using the attached python script, which tests a large number
of different data distributions, comparing the results with the patch
vs those from HEAD. It uses EXPLAIN ANALYSE to compare the estimates
against actual row counts, both for values in the MCV list, and
non-MCV values, making it possible to tell whether it would have been
better if the MCV list had been bigger or smaller.

There's a lot of random variation between tests, but by running a lot
of them, the general pattern can be seen. I ran the test 9 times, with
different MCV cutoff values in the patched code of 10%, 15%, ... 50%,
then collected all the results from the test runs into a single CSV
file to analyse using SQL. The columns in the CSV are:

  test_name - A textual description of the data distribution used in the test.

  mcv_cutoff - The relative standard error cutoff percentage used (10,
15, 20, ... 50), or -10, -15, ... -50 for the corresponding tests
against HEAD.

  stats_tgt - The stats target used (100 or 1000).

  num_mcvs - The size of the resulting MCV list.

  avg_mcv_err - The average percentage difference between estimated
and actual row counts for values in the MCV list. (There's a bit of a
fudge factor in calculating these percentages, to avoid them becoming
too large in cases where the row counts are small, but I think it's
still useful for comparison purposes.)

  max_mcv_err - The maximum percentage difference between estimated
and actual row counts for values in the MCV list.

  avg_non_mcv_err - The average percentage difference between
estimated and actual row counts for a bunch of non-MCV values.

  max_non_mcv_err - The maximum percentage difference between
estimated and actual row counts for a bunch of non-MCV values.

  avg_err - The average percentage difference between estimated and
actual row counts for all values tested.

  max_err - The maximum percentage difference between estimated and
actual row counts for all values tested.

>From this, it's possible to look for cases where the MCV list is too
large or too small. For example, looking for too-many-mcv cases
(generally the more uniform data distributions):

SELECT mcv_cutoff, count(*)
  FROM results
 WHERE max_mcv_err - max_non_mcv_err > 50
 GROUP BY 1
 ORDER BY 2 DESC;

 mcv_cutoff | count
------------+-------
        -40 |    41
        -50 |    40
        -25 |    39
        -15 |    38
        -20 |    38
        -30 |    38
        -35 |    38
        -45 |    37
        -10 |    37
         50 |    35
         40 |    35
         45 |    34
         35 |    33
         30 |    33
         25 |    25
         20 |    13
         15 |     6
         10 |     3
(18 rows)

SELECT mcv_cutoff, count(*)
  FROM results
 WHERE max_mcv_err - max_non_mcv_err > 100
 GROUP BY 1
 ORDER BY 2 DESC;

 mcv_cutoff | count
------------+-------
        -30 |    17
        -40 |    16
        -15 |    15
        -25 |    14
        -10 |    14
        -35 |    14
        -45 |    13
        -50 |    13
        -20 |    12
         35 |    12
         45 |    11
         40 |    10
         30 |    10
         50 |    10
         25 |     6
         10 |     2
(16 rows)
( and implicitly:
         15 |     0
         20 |     0 )

So the patched code is generally better at avoiding the too-many-mcvs
problem, but an mcv_cutoff of 30% or more may be too high.

Looking for too-few-mcv cases:

SELECT mcv_cutoff, count(*)
  FROM results
 WHERE (max_non_mcv_err - max_mcv_err > 50 AND num_mcvs < stats_tgt)
 GROUP BY 1
 ORDER BY 2 DESC;

 mcv_cutoff | count
------------+-------
        -20 |   120
        -50 |   116
        -30 |   115
        -35 |   114
        -25 |   114
        -10 |   114
        -45 |   113
         10 |   113
        -40 |   113
        -15 |   111
         15 |    98
         20 |    49
         25 |    44
         40 |    39
         30 |    38
         45 |    37
         35 |    37
         50 |    35
(18 rows)

SELECT mcv_cutoff, count(*)
  FROM results
 WHERE (max_non_mcv_err - max_mcv_err > 100 AND num_mcvs < stats_tgt)
 GROUP BY 1
 ORDER BY 2 DESC;

 mcv_cutoff | count
------------+-------
        -20 |   119
        -50 |   116
        -30 |   115
        -40 |   113
        -45 |   112
        -25 |   112
        -10 |   112
        -35 |   112
        -15 |   111
         10 |   106
         15 |    51
         20 |    45
         25 |    43
         30 |    38
         40 |    37
         35 |    36
         45 |    34
         50 |    31
(18 rows)

and looking specifically to see to what extent the problem persists
when the stats target is increased to 1000:

SELECT mcv_cutoff, count(*)
  FROM results
 WHERE max_non_mcv_err - max_mcv_err > 100
   AND stats_tgt = 1000
 GROUP BY 1
 ORDER BY 2 DESC;

 mcv_cutoff | count
------------+-------
        -20 |    69
        -30 |    67
        -50 |    67
        -35 |    66
        -40 |    66
        -10 |    66
        -15 |    65
        -25 |    64
        -45 |    63
         10 |    60
         30 |     8
         20 |     8
         45 |     8
         15 |     8
         35 |     7
         50 |     7
         25 |     7
         40 |     6
(18 rows)

So again, the patched code is generally better at avoiding the
too-few-mcvs problem, but an mcv_cutoff of 10% is almost certainly too
low.

Looking at the effect of changing the stats target from 100 to 1000 in
the tests:

SELECT r1.mcv_cutoff, sum(r2.num_mcvs - r1.num_mcvs)
  FROM results r1, results r2
 WHERE r2.test_name = r1.test_name
   AND r2.mcv_cutoff = r1.mcv_cutoff
   AND r1.stats_tgt = 100
   AND r2.stats_tgt = 1000
 GROUP BY 1
 ORDER BY 2 DESC;

 mcv_cutoff |  sum
------------+-------
         15 | 88582
         20 | 87124
         35 | 86828
         10 | 86749
         45 | 86632
         40 | 86552
         50 | 86384
         25 | 85567
         30 | 85352
        -25 | 28596
        -15 | 28567
        -35 | 28559
        -40 | 28517
        -20 | 28509
        -30 | 28483
        -45 | 28464
        -50 | 28452
        -10 | 28411
(18 rows)

it's clear that the patched code is much better at responding to
increasing the stats target and producing more MCVs.

Interestingly, I noticed while eyeballing the raw test output that
with HEAD there are quite a few instances where increasing the stats
target actually decreases the size of the MCV list, and also these are
often in cases where the data is more non-uniformly distributed, and
the MCV list is too small to start with:

SELECT r1.mcv_cutoff, sum(r2.num_mcvs - r1.num_mcvs)
  FROM results r1, results r2
 WHERE r2.test_name = r1.test_name
   AND r2.mcv_cutoff = r1.mcv_cutoff
   AND r1.stats_tgt = 100
   AND r2.stats_tgt = 1000
   AND r2.num_mcvs < r1.num_mcvs
 GROUP BY 1
 ORDER BY 2 DESC;

 mcv_cutoff |  sum
------------+-------
         15 |    -4
         10 |   -11
        -35 | -5475
        -20 | -5493
        -50 | -5507
        -40 | -5510
        -45 | -5510
        -30 | -5511
        -25 | -5514
        -10 | -5528
        -15 | -5532
(11 rows)

I could probably keep going, querying the test result data in all
sorts of other ways (and it's attached, should anyone wish to do so,
although bear in mind that it's subject to quite large random
variations), but my general conclusions are:

- The patched code is better than HEAD at avoiding the too-many-mcvs
problem, unless the mcv_cutoff is set too high (maybe around 30% or
more).

- The patched code is much better at avoiding the too-few-mcvs
problem, unless the mcv_cufoff is set too low (10% seems to be too
low).

- The patched code is much better at producing more MCVs as the stats
target is increased, making it better able to handle more non-uniform
distributions.

- An mcv_cutoff of 20% looks like a good general-purpose value.

Regards,
Dean

mcv-tests.tar.bz2
Description: application/bzip

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
new file mode 100644
index 5f21fcb..3d6ce47
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1766,6 +1766,7 @@ static void compute_scalar_stats(VacAttr
 					 double totalrows);
 static int	compare_scalars(const void *a, const void *b, void *arg);
 static int	compare_mcvs(const void *a, const void *b);
+static double get_mincount_for_mcv_list(int samplerows, double totalrows);
 
 
 /*
@@ -2183,10 +2184,9 @@ compute_distinct_stats(VacAttrStatsP sta
 		 * Decide how many values are worth storing as most-common values. If
 		 * we are able to generate a complete MCV list (all the values in the
 		 * sample will fit, and we think these are all the ones in the table),
-		 * then do so.  Otherwise, store only those values that are
-		 * significantly more common than the (estimated) average. We set the
-		 * threshold rather arbitrarily at 25% more than average, with at
-		 * least 2 instances in the sample.
+		 * then do so.  Otherwise, store only those values that appear
+		 * sufficiently often in the sample that it is reasonable to
+		 * extrapolate their sample frequencies to the entire table.
 		 *
 		 * Note: the first of these cases is meant to address columns with
 		 * small, fixed sets of possible values, such as boolean or enum
@@ -2196,7 +2196,7 @@ compute_distinct_stats(VacAttrStatsP sta
 		 * the MCV list is not complete, it's generally worth being more
 		 * selective, and not just filling it all the way up to the stats
 		 * target.  So for an incomplete list, we try to take only MCVs that
-		 * are significantly more common than average.
+		 * are common enough to produce reasonably accurate estimates.
 		 */
 		if (track_cnt < track_max && toowide_cnt == 0 &&
 			stats->stadistinct > 0 &&
@@ -2207,19 +2207,11 @@ compute_distinct_stats(VacAttrStatsP sta
 		}
 		else
 		{
-			double		ndistinct_table = stats->stadistinct;
-			double		avgcount,
-						mincount;
+			double		mincount;
+
+			/* Keep only the values common enough to give decent estimates */
+			mincount = get_mincount_for_mcv_list(samplerows, totalrows);
 
-			/* Re-extract estimate of # distinct nonnull values in table */
-			if (ndistinct_table < 0)
-				ndistinct_table = -ndistinct_table * totalrows;
-			/* estimate # occurrences in sample of a typical nonnull value */
-			avgcount = (double) nonnull_cnt / ndistinct_table;
-			/* set minimum threshold count to store a value */
-			mincount = avgcount * 1.25;
-			if (mincount < 2)
-				mincount = 2;
 			if (num_mcv > track_cnt)
 				num_mcv = track_cnt;
 			for (i = 0; i < num_mcv; i++)
@@ -2557,15 +2549,9 @@ compute_scalar_stats(VacAttrStatsP stats
 		 * Decide how many values are worth storing as most-common values. If
 		 * we are able to generate a complete MCV list (all the values in the
 		 * sample will fit, and we think these are all the ones in the table),
-		 * then do so.  Otherwise, store only those values that are
-		 * significantly more common than the (estimated) average. We set the
-		 * threshold rather arbitrarily at 25% more than average, with at
-		 * least 2 instances in the sample.  Also, we won't suppress values
-		 * that have a frequency of at least 1/K where K is the intended
-		 * number of histogram bins; such values might otherwise cause us to
-		 * emit duplicate histogram bin boundaries.  (We might end up with
-		 * duplicate histogram entries anyway, if the distribution is skewed;
-		 * but we prefer to treat such values as MCVs if at all possible.)
+		 * then do so.  Otherwise, store only those values that appear
+		 * sufficiently often in the sample that it is reasonable to
+		 * extrapolate their sample frequencies to the entire table.
 		 *
 		 * Note: the first of these cases is meant to address columns with
 		 * small, fixed sets of possible values, such as boolean or enum
@@ -2575,7 +2561,7 @@ compute_scalar_stats(VacAttrStatsP stats
 		 * the MCV list is not complete, it's generally worth being more
 		 * selective, and not just filling it all the way up to the stats
 		 * target.  So for an incomplete list, we try to take only MCVs that
-		 * are significantly more common than average.
+		 * are common enough to produce reasonably accurate estimates.
 		 */
 		if (track_cnt == ndistinct && toowide_cnt == 0 &&
 			stats->stadistinct > 0 &&
@@ -2586,24 +2572,11 @@ compute_scalar_stats(VacAttrStatsP stats
 		}
 		else
 		{
-			double		ndistinct_table = stats->stadistinct;
-			double		avgcount,
-						mincount,
-						maxmincount;
+			double		mincount;
+
+			/* Keep only the values common enough to give decent estimates */
+			mincount = get_mincount_for_mcv_list(samplerows, totalrows);
 
-			/* Re-extract estimate of # distinct nonnull values in table */
-			if (ndistinct_table < 0)
-				ndistinct_table = -ndistinct_table * totalrows;
-			/* estimate # occurrences in sample of a typical nonnull value */
-			avgcount = (double) nonnull_cnt / ndistinct_table;
-			/* set minimum threshold count to store a value */
-			mincount = avgcount * 1.25;
-			if (mincount < 2)
-				mincount = 2;
-			/* don't let threshold exceed 1/K, however */
-			maxmincount = (double) values_cnt / (double) num_bins;
-			if (mincount > maxmincount)
-				mincount = maxmincount;
 			if (num_mcv > track_cnt)
 				num_mcv = track_cnt;
 			for (i = 0; i < num_mcv; i++)
@@ -2878,3 +2851,67 @@ compare_mcvs(const void *a, const void *
 
 	return da - db;
 }
+
+/*
+ * Determine the minimum number of times a value needs to appear in the sample
+ * for it to be included in the MCV list.
+ */
+static double
+get_mincount_for_mcv_list(int samplerows, double totalrows)
+{
+	/*----------
+	 * We want to keep only values that appear sufficiently often in the
+	 * sample that it is reasonable to extrapolate their sample frequencies to
+	 * the entire table.  We do this by placing an upper bound on the relative
+	 * standard error of the sample frequency, so that any estimates the
+	 * planner generates from the MCV statistics can be expected to be
+	 * reasonably accurate.
+	 *
+	 * Since we are sampling without replacement, the sample frequency of a
+	 * particular value is described by a hypergeometric distribution.  A
+	 * common rule of thumb when estimating errors in this situation is to
+	 * require at least 10 instances of the value in the sample, in which case
+	 * the distribution can be approximated by a normal distribution, and
+	 * standard error analysis techniques can be applied.  Given a sample size
+	 * of n, a population size of N, and a sample frequency of p=cnt/n, the
+	 * standard error of the proportion p is given by
+	 *		SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1))
+	 * where the second term is the finite population correction.  To get
+	 * reasonably accurate planner estimates, we impose an upper bound on the
+	 * relative standard error of 20% -- i.e., SE/p < 0.2.  This 20% relative
+	 * error bound is fairly arbitrary, but has been found empirically to work
+	 * well.  Rearranging this formula gives a lower bound on the number of
+	 * instances of the value seen:
+	 *		cnt > n*(N-n) / (N-n+0.04*n*(N-1))
+	 * This bound is at most 25, and approaches 0 as n approaches 0 or N. The
+	 * case where n approaches 0 cannot happen in practice, since the sample
+	 * size is at least 300.  The case where n approaches N corresponds to
+	 * sampling the whole the table, in which case it is reasonable to keep
+	 * the whole MCV list (have no lower bound), so it makes sense to apply
+	 * this formula for all inputs, even though the above derivation is
+	 * technically only valid when the right hand side is at least around 10.
+	 *
+	 * An alternative way to look at this formula is as follows -- assume that
+	 * the number of instances of the value seen scales up to the entire
+	 * table, so that the population count is K=N*cnt/n. Then the distribution
+	 * in the sample is a hypergeometric distribution parameterised by N, n
+	 * and K, and the bound above is mathematically equivalent to demanding
+	 * that the standard deviation of that distribution is less than 20% of
+	 * its mean.  Thus the relative errors in any planner estimates produced
+	 * from the MCV statistics are likely to be not too large.
+	 *----------
+	 */
+	double		n = samplerows;
+	double		N = totalrows;
+	double		numer,
+				denom;
+
+	numer = n * (N - n);
+	denom = N - n + 0.04 * n * (N - 1);
+
+	/* Guard against division by zero (possible if n = N = 1) */
+	if (denom == 0)
+		return 0;
+
+	return numer / denom;
+}

Re: MCV lists for highly skewed distributions

Reply via email to