On 7 February 2018 at 15:58, Dean Rasheed <dean.a.rash...@gmail.com> wrote: > On 7 February 2018 at 15:25, Robert Haas <robertmh...@gmail.com> wrote: >> Do you plan to press forward with this, then, or what's >> the next step? > > I plan to do more testing
TL;DR: I tested this new algorithm using 9 different relative standard error cutoffs (10%, 15%, ... 50%) against a variety of different data distributions, and it looks like a definite improvement over the current algorithm, for a wide range of cutoff values. Overall, it isn't that sensitive to the exact cutoff chosen, but it looks like a value of 20% is a good general-purpose choice. One of the properties of the new algorithm is that the size of the MCV list responds better to changing the stats target, so I don't think it's worth having a separate knob to control the cutoff percentage. Such a knob would have a similar, but weaker effect than the stats target knob, and thus doesn't seem worthwhile. Attached is an updated patch. I decided to move the code that determines the minimum number of occurrences for an MCV list item out to a separate function. It's not much code, but there's a large comment to explain its statistical justification, and I didn't want to duplicate that in 2 different places (possibly to become 3, with the multivariate MCV stats patch). I think this is ready for commit now, so barring objections, I will do so in the next day or so. --- I tested using the attached python script, which tests a large number of different data distributions, comparing the results with the patch vs those from HEAD. It uses EXPLAIN ANALYSE to compare the estimates against actual row counts, both for values in the MCV list, and non-MCV values, making it possible to tell whether it would have been better if the MCV list had been bigger or smaller. There's a lot of random variation between tests, but by running a lot of them, the general pattern can be seen. I ran the test 9 times, with different MCV cutoff values in the patched code of 10%, 15%, ... 50%, then collected all the results from the test runs into a single CSV file to analyse using SQL. The columns in the CSV are: test_name - A textual description of the data distribution used in the test. mcv_cutoff - The relative standard error cutoff percentage used (10, 15, 20, ... 50), or -10, -15, ... -50 for the corresponding tests against HEAD. stats_tgt - The stats target used (100 or 1000). num_mcvs - The size of the resulting MCV list. avg_mcv_err - The average percentage difference between estimated and actual row counts for values in the MCV list. (There's a bit of a fudge factor in calculating these percentages, to avoid them becoming too large in cases where the row counts are small, but I think it's still useful for comparison purposes.) max_mcv_err - The maximum percentage difference between estimated and actual row counts for values in the MCV list. avg_non_mcv_err - The average percentage difference between estimated and actual row counts for a bunch of non-MCV values. max_non_mcv_err - The maximum percentage difference between estimated and actual row counts for a bunch of non-MCV values. avg_err - The average percentage difference between estimated and actual row counts for all values tested. max_err - The maximum percentage difference between estimated and actual row counts for all values tested. >From this, it's possible to look for cases where the MCV list is too large or too small. For example, looking for too-many-mcv cases (generally the more uniform data distributions): SELECT mcv_cutoff, count(*) FROM results WHERE max_mcv_err - max_non_mcv_err > 50 GROUP BY 1 ORDER BY 2 DESC; mcv_cutoff | count ------------+------- -40 | 41 -50 | 40 -25 | 39 -15 | 38 -20 | 38 -30 | 38 -35 | 38 -45 | 37 -10 | 37 50 | 35 40 | 35 45 | 34 35 | 33 30 | 33 25 | 25 20 | 13 15 | 6 10 | 3 (18 rows) SELECT mcv_cutoff, count(*) FROM results WHERE max_mcv_err - max_non_mcv_err > 100 GROUP BY 1 ORDER BY 2 DESC; mcv_cutoff | count ------------+------- -30 | 17 -40 | 16 -15 | 15 -25 | 14 -10 | 14 -35 | 14 -45 | 13 -50 | 13 -20 | 12 35 | 12 45 | 11 40 | 10 30 | 10 50 | 10 25 | 6 10 | 2 (16 rows) ( and implicitly: 15 | 0 20 | 0 ) So the patched code is generally better at avoiding the too-many-mcvs problem, but an mcv_cutoff of 30% or more may be too high. Looking for too-few-mcv cases: SELECT mcv_cutoff, count(*) FROM results WHERE (max_non_mcv_err - max_mcv_err > 50 AND num_mcvs < stats_tgt) GROUP BY 1 ORDER BY 2 DESC; mcv_cutoff | count ------------+------- -20 | 120 -50 | 116 -30 | 115 -35 | 114 -25 | 114 -10 | 114 -45 | 113 10 | 113 -40 | 113 -15 | 111 15 | 98 20 | 49 25 | 44 40 | 39 30 | 38 45 | 37 35 | 37 50 | 35 (18 rows) SELECT mcv_cutoff, count(*) FROM results WHERE (max_non_mcv_err - max_mcv_err > 100 AND num_mcvs < stats_tgt) GROUP BY 1 ORDER BY 2 DESC; mcv_cutoff | count ------------+------- -20 | 119 -50 | 116 -30 | 115 -40 | 113 -45 | 112 -25 | 112 -10 | 112 -35 | 112 -15 | 111 10 | 106 15 | 51 20 | 45 25 | 43 30 | 38 40 | 37 35 | 36 45 | 34 50 | 31 (18 rows) and looking specifically to see to what extent the problem persists when the stats target is increased to 1000: SELECT mcv_cutoff, count(*) FROM results WHERE max_non_mcv_err - max_mcv_err > 100 AND stats_tgt = 1000 GROUP BY 1 ORDER BY 2 DESC; mcv_cutoff | count ------------+------- -20 | 69 -30 | 67 -50 | 67 -35 | 66 -40 | 66 -10 | 66 -15 | 65 -25 | 64 -45 | 63 10 | 60 30 | 8 20 | 8 45 | 8 15 | 8 35 | 7 50 | 7 25 | 7 40 | 6 (18 rows) So again, the patched code is generally better at avoiding the too-few-mcvs problem, but an mcv_cutoff of 10% is almost certainly too low. Looking at the effect of changing the stats target from 100 to 1000 in the tests: SELECT r1.mcv_cutoff, sum(r2.num_mcvs - r1.num_mcvs) FROM results r1, results r2 WHERE r2.test_name = r1.test_name AND r2.mcv_cutoff = r1.mcv_cutoff AND r1.stats_tgt = 100 AND r2.stats_tgt = 1000 GROUP BY 1 ORDER BY 2 DESC; mcv_cutoff | sum ------------+------- 15 | 88582 20 | 87124 35 | 86828 10 | 86749 45 | 86632 40 | 86552 50 | 86384 25 | 85567 30 | 85352 -25 | 28596 -15 | 28567 -35 | 28559 -40 | 28517 -20 | 28509 -30 | 28483 -45 | 28464 -50 | 28452 -10 | 28411 (18 rows) it's clear that the patched code is much better at responding to increasing the stats target and producing more MCVs. Interestingly, I noticed while eyeballing the raw test output that with HEAD there are quite a few instances where increasing the stats target actually decreases the size of the MCV list, and also these are often in cases where the data is more non-uniformly distributed, and the MCV list is too small to start with: SELECT r1.mcv_cutoff, sum(r2.num_mcvs - r1.num_mcvs) FROM results r1, results r2 WHERE r2.test_name = r1.test_name AND r2.mcv_cutoff = r1.mcv_cutoff AND r1.stats_tgt = 100 AND r2.stats_tgt = 1000 AND r2.num_mcvs < r1.num_mcvs GROUP BY 1 ORDER BY 2 DESC; mcv_cutoff | sum ------------+------- 15 | -4 10 | -11 -35 | -5475 -20 | -5493 -50 | -5507 -40 | -5510 -45 | -5510 -30 | -5511 -25 | -5514 -10 | -5528 -15 | -5532 (11 rows) I could probably keep going, querying the test result data in all sorts of other ways (and it's attached, should anyone wish to do so, although bear in mind that it's subject to quite large random variations), but my general conclusions are: - The patched code is better than HEAD at avoiding the too-many-mcvs problem, unless the mcv_cutoff is set too high (maybe around 30% or more). - The patched code is much better at avoiding the too-few-mcvs problem, unless the mcv_cufoff is set too low (10% seems to be too low). - The patched code is much better at producing more MCVs as the stats target is increased, making it better able to handle more non-uniform distributions. - An mcv_cutoff of 20% looks like a good general-purpose value. Regards, Dean
mcv-tests.tar.bz2
Description: application/bzip
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c new file mode 100644 index 5f21fcb..3d6ce47 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -1766,6 +1766,7 @@ static void compute_scalar_stats(VacAttr double totalrows); static int compare_scalars(const void *a, const void *b, void *arg); static int compare_mcvs(const void *a, const void *b); +static double get_mincount_for_mcv_list(int samplerows, double totalrows); /* @@ -2183,10 +2184,9 @@ compute_distinct_stats(VacAttrStatsP sta * Decide how many values are worth storing as most-common values. If * we are able to generate a complete MCV list (all the values in the * sample will fit, and we think these are all the ones in the table), - * then do so. Otherwise, store only those values that are - * significantly more common than the (estimated) average. We set the - * threshold rather arbitrarily at 25% more than average, with at - * least 2 instances in the sample. + * then do so. Otherwise, store only those values that appear + * sufficiently often in the sample that it is reasonable to + * extrapolate their sample frequencies to the entire table. * * Note: the first of these cases is meant to address columns with * small, fixed sets of possible values, such as boolean or enum @@ -2196,7 +2196,7 @@ compute_distinct_stats(VacAttrStatsP sta * the MCV list is not complete, it's generally worth being more * selective, and not just filling it all the way up to the stats * target. So for an incomplete list, we try to take only MCVs that - * are significantly more common than average. + * are common enough to produce reasonably accurate estimates. */ if (track_cnt < track_max && toowide_cnt == 0 && stats->stadistinct > 0 && @@ -2207,19 +2207,11 @@ compute_distinct_stats(VacAttrStatsP sta } else { - double ndistinct_table = stats->stadistinct; - double avgcount, - mincount; + double mincount; + + /* Keep only the values common enough to give decent estimates */ + mincount = get_mincount_for_mcv_list(samplerows, totalrows); - /* Re-extract estimate of # distinct nonnull values in table */ - if (ndistinct_table < 0) - ndistinct_table = -ndistinct_table * totalrows; - /* estimate # occurrences in sample of a typical nonnull value */ - avgcount = (double) nonnull_cnt / ndistinct_table; - /* set minimum threshold count to store a value */ - mincount = avgcount * 1.25; - if (mincount < 2) - mincount = 2; if (num_mcv > track_cnt) num_mcv = track_cnt; for (i = 0; i < num_mcv; i++) @@ -2557,15 +2549,9 @@ compute_scalar_stats(VacAttrStatsP stats * Decide how many values are worth storing as most-common values. If * we are able to generate a complete MCV list (all the values in the * sample will fit, and we think these are all the ones in the table), - * then do so. Otherwise, store only those values that are - * significantly more common than the (estimated) average. We set the - * threshold rather arbitrarily at 25% more than average, with at - * least 2 instances in the sample. Also, we won't suppress values - * that have a frequency of at least 1/K where K is the intended - * number of histogram bins; such values might otherwise cause us to - * emit duplicate histogram bin boundaries. (We might end up with - * duplicate histogram entries anyway, if the distribution is skewed; - * but we prefer to treat such values as MCVs if at all possible.) + * then do so. Otherwise, store only those values that appear + * sufficiently often in the sample that it is reasonable to + * extrapolate their sample frequencies to the entire table. * * Note: the first of these cases is meant to address columns with * small, fixed sets of possible values, such as boolean or enum @@ -2575,7 +2561,7 @@ compute_scalar_stats(VacAttrStatsP stats * the MCV list is not complete, it's generally worth being more * selective, and not just filling it all the way up to the stats * target. So for an incomplete list, we try to take only MCVs that - * are significantly more common than average. + * are common enough to produce reasonably accurate estimates. */ if (track_cnt == ndistinct && toowide_cnt == 0 && stats->stadistinct > 0 && @@ -2586,24 +2572,11 @@ compute_scalar_stats(VacAttrStatsP stats } else { - double ndistinct_table = stats->stadistinct; - double avgcount, - mincount, - maxmincount; + double mincount; + + /* Keep only the values common enough to give decent estimates */ + mincount = get_mincount_for_mcv_list(samplerows, totalrows); - /* Re-extract estimate of # distinct nonnull values in table */ - if (ndistinct_table < 0) - ndistinct_table = -ndistinct_table * totalrows; - /* estimate # occurrences in sample of a typical nonnull value */ - avgcount = (double) nonnull_cnt / ndistinct_table; - /* set minimum threshold count to store a value */ - mincount = avgcount * 1.25; - if (mincount < 2) - mincount = 2; - /* don't let threshold exceed 1/K, however */ - maxmincount = (double) values_cnt / (double) num_bins; - if (mincount > maxmincount) - mincount = maxmincount; if (num_mcv > track_cnt) num_mcv = track_cnt; for (i = 0; i < num_mcv; i++) @@ -2878,3 +2851,67 @@ compare_mcvs(const void *a, const void * return da - db; } + +/* + * Determine the minimum number of times a value needs to appear in the sample + * for it to be included in the MCV list. + */ +static double +get_mincount_for_mcv_list(int samplerows, double totalrows) +{ + /*---------- + * We want to keep only values that appear sufficiently often in the + * sample that it is reasonable to extrapolate their sample frequencies to + * the entire table. We do this by placing an upper bound on the relative + * standard error of the sample frequency, so that any estimates the + * planner generates from the MCV statistics can be expected to be + * reasonably accurate. + * + * Since we are sampling without replacement, the sample frequency of a + * particular value is described by a hypergeometric distribution. A + * common rule of thumb when estimating errors in this situation is to + * require at least 10 instances of the value in the sample, in which case + * the distribution can be approximated by a normal distribution, and + * standard error analysis techniques can be applied. Given a sample size + * of n, a population size of N, and a sample frequency of p=cnt/n, the + * standard error of the proportion p is given by + * SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1)) + * where the second term is the finite population correction. To get + * reasonably accurate planner estimates, we impose an upper bound on the + * relative standard error of 20% -- i.e., SE/p < 0.2. This 20% relative + * error bound is fairly arbitrary, but has been found empirically to work + * well. Rearranging this formula gives a lower bound on the number of + * instances of the value seen: + * cnt > n*(N-n) / (N-n+0.04*n*(N-1)) + * This bound is at most 25, and approaches 0 as n approaches 0 or N. The + * case where n approaches 0 cannot happen in practice, since the sample + * size is at least 300. The case where n approaches N corresponds to + * sampling the whole the table, in which case it is reasonable to keep + * the whole MCV list (have no lower bound), so it makes sense to apply + * this formula for all inputs, even though the above derivation is + * technically only valid when the right hand side is at least around 10. + * + * An alternative way to look at this formula is as follows -- assume that + * the number of instances of the value seen scales up to the entire + * table, so that the population count is K=N*cnt/n. Then the distribution + * in the sample is a hypergeometric distribution parameterised by N, n + * and K, and the bound above is mathematically equivalent to demanding + * that the standard deviation of that distribution is less than 20% of + * its mean. Thus the relative errors in any planner estimates produced + * from the MCV statistics are likely to be not too large. + *---------- + */ + double n = samplerows; + double N = totalrows; + double numer, + denom; + + numer = n * (N - n); + denom = N - n + 0.04 * n * (N - 1); + + /* Guard against division by zero (possible if n = N = 1) */ + if (denom == 0) + return 0; + + return numer / denom; +}