(Starting a new thread so as not to distract review) On 1/21/18, Dean Rasheed <dean.a.rash...@gmail.com> wrote: > On 21 January 2018 at 07:26, John Naylor <jcnay...@gmail.com> wrote: >> I spent a few hours hacking on this, and it turns out calculating the >> right number of MCVs taking into account both uniform and highly >> non-uniform distributions is too delicate a problem for me to solve >> right now. The logic suggested by Dean Rasheed in [1] always produces >> no MCVs for a perfectly uniform distribution (which is good), but very >> often also for other distributions, which is not good. My efforts to >> tweak that didn't work, so I didn't get as far as adapting it for the >> problem Jeff is trying to solve. > > Hmm, Tom suggested that the test based on the average frequency over > all values might be too strict because the estimated number of > distinct values is often too low, so that might explain what you're > seeing.
In my test tables, I've noticed that our Ndistinct estimator is most inaccurate for geometric distributions, so that's certainly possible, but confusingly, it occasionally gave an empty MCV list along with a histogram with a boundary duplicated 5 times, which I thought I was guarding against. I'm thinking my implementation of your logic is flawed somehow. In case you're curious I've attached my rough (complier warnings and all) test patch. > It occurs to me that maybe a better test to exclude a value from the > MCV list would be to demand that its relative standard error not be > too high. Such a test, in addition to the existing tests, might be > sufficient to solve the opposite problem of too many values in the MCV > list, because the real problem there is including a value after having > seen relatively few occurrences of it in the sample, and thus having a > wildly inaccurate estimate for it. Setting a bound on the relative > standard error would mean that we could have a reasonable degree of > confidence in estimates produced from the sample. If you don't mind, what would the math look like for that? -John Naylor
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 5f21fcb..da21333 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -2318,6 +2318,8 @@ compute_scalar_stats(VacAttrStatsP stats, int num_mcv = stats->attr->attstattarget; int num_bins = stats->attr->attstattarget; StdAnalyzeData *mystats = (StdAnalyzeData *) stats->extra_data; + double N, + n; values = (ScalarItem *) palloc(samplerows * sizeof(ScalarItem)); tupnoLink = (int *) palloc(samplerows * sizeof(int)); @@ -2525,10 +2527,10 @@ compute_scalar_stats(VacAttrStatsP stats, */ int f1 = ndistinct - nmultiple + toowide_cnt; int d = f1 + nmultiple; - double n = samplerows - null_cnt; - double N = totalrows * (1.0 - stats->stanullfrac); double stadistinct; + n = samplerows - null_cnt; + N = totalrows * (1.0 - stats->stanullfrac); /* N == 0 shouldn't happen, but just in case ... */ if (N > 0) stadistinct = (n * d) / ((n - f1) + f1 * n / N); @@ -2558,9 +2560,44 @@ compute_scalar_stats(VacAttrStatsP stats, * we are able to generate a complete MCV list (all the values in the * sample will fit, and we think these are all the ones in the table), * then do so. Otherwise, store only those values that are - * significantly more common than the (estimated) average. We set the - * threshold rather arbitrarily at 25% more than average, with at - * least 2 instances in the sample. Also, we won't suppress values + * significantly more common than the (estimated) average. + * + * Note: For this POC patch, the implementation and comments + * were copied from an email from Dean Rasheed, which contains further references: + * https://www.postgresql.org/message-id/CAEZATCVu9zK0N%3Dnd9ufavabbM8YZiyWYJca0oiE8F31GAY%2B_XA%40mail.gmail.com + * + * We calculate the threshold from the table and sample sizes. + + * The initial rule of thumb is that the value should occur at + * least 10 times in the sample. + * + * Suppose that N is the population size (total number of rows in the + * table), and n is the sample size, and that some particular candidate + * value appears x times in the sample. Then the "sample proportion" is + * given by p = x/n. + * + * It is reasonable to treat p as having a normal distribution, which + * then allows the margin of error to be analysed using standard + * techniques. We calculate the standard error of the sample proportion: + * + * SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1)) + * + * The second term is a finite population correction. There is a 95% + * probability that the total population proportion lies in the range + * + * [ pmin = p-2*SE, pmax = p+2*SE ] + * + * If there are Nd distinct values in the table, so that the average + * frequency of occurrence of any particular value is 1/Nd, then the test + * + * pmin > 1/Nd + * + * would imply that there is a 97.5% probably that the candidate value is + * more common than the average (since the 5% of the distribution of p + * outside the confidence interval is split evenly below pmin and above + * pmax). + * + * Also, we won't suppress values * that have a frequency of at least 1/K where K is the intended * number of histogram bins; such values might otherwise cause us to * emit duplicate histogram bin boundaries. (We might end up with @@ -2587,28 +2624,36 @@ compute_scalar_stats(VacAttrStatsP stats, else { double ndistinct_table = stats->stadistinct; - double avgcount, - mincount, - maxmincount; + double avg_freq, + hist_width, + p, + pmin, + SE; /* Re-extract estimate of # distinct nonnull values in table */ if (ndistinct_table < 0) ndistinct_table = -ndistinct_table * totalrows; - /* estimate # occurrences in sample of a typical nonnull value */ - avgcount = (double) nonnull_cnt / ndistinct_table; - /* set minimum threshold count to store a value */ - mincount = avgcount * 1.25; - if (mincount < 2) - mincount = 2; - /* don't let threshold exceed 1/K, however */ - maxmincount = (double) values_cnt / (double) num_bins; - if (mincount > maxmincount) - mincount = maxmincount; + /* Compute average frequency */ + avg_freq = (double) 1 / (double) ndistinct_table; + /* + * Compute 1/K. If a value has a higher frequency and is not + * included in the MCV list, we might emit duplicate + * histogram bin boundaries. + */ + hist_width = (double) 1 / (double) num_bins; + if (num_mcv > track_cnt) num_mcv = track_cnt; + /* Test values for minimum threshold frequency */ for (i = 0; i < num_mcv; i++) { - if (track[i].count < mincount) + p = (double) track[i].count / (double) n; + SE = sqrt( (double) p * (1 - p) * (N - n) / (double) (n * (N - 1)) ); + pmin = p - 2 * SE; + + if ((pmin < avg_freq && pmin < hist_width) + //~ if ((pmin < avg_freq ) + || track[i].count < 10) { num_mcv = i; break;