MCV lists for highly skewed distributions

Jeff Janes Wed, 27 Dec 2017 17:47:00 -0800

I want to revive a patch I sent  couple years ago to the performance list,
as I have seen the issue pop up repeatedly since then.

Original thread is here:
https://www.postgresql.org/message-id/CAK_s-G2tc8r0N3AqPr8fW5QsRQMbZNurgAQ%3D_ME1aaf4vOmnnA%40mail.gmail.com

The problem is that if you have a highly skewed distribution, such as
exponential frequencies, it usually stores too few entries in the MCV list.

For example:

create table foo as select floor(-log(random())/log(2))::int as who from
generate_series(1,10000000);

This will usually store 0 to 5 as MCV with the default stat target[1].
Then value 6 is not stored, and its ~75k repetitions get averaged over the
remaining distinct values. This leads to horrible misplanning for rare
values. For example, who=17 has 37 entries, but is estimated to have
20,000. The correct join for 37 values is often not the correct one for
20,000.

If we stored just a few more values, their inclusion in the MCV would mean
they are depleted from the residual count, correctly lowering the estimate
we would get for very rare values not included in the sample.

So instead of having the threshold of 1.25x the average frequency over all
values, I think we should use 1.25x the average frequency of only those
values not already included in the MCV, as in the attached.

As it is, you can partially overcome the too short MCV list by cranking up
the default statistics target, but this is a weak effect and comes at a
high cost of CPU time. In some of the real distributions I've looked at,
cranking up default statistics target is almost entirely ineffective.

I think that perhaps maxmincount should also use the dynamic
values_cnt_remaining rather than the static one. After all, things
included in the MCV don' get represented in the histogram. When I've seen
planning problems due to skewed distributions I also usually see redundant
values in the histogram boundary list which I think should be in the MCV
list instead. But I have not changed that here, pending discussion.

[1] Occasionally it will store a much longer MCV list, because no values
was sampled exactly once, which triggers a different code path in which all
seen values are put in the MCV and the histogram is NULL. This is not
reliable, as whether the least-sample value is present in the sample once
or twice is pretty brittle.

Cheers,

Jeff

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index f952b3c..b02ac20 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -2575,7 +2575,8 @@ compute_scalar_stats(VacAttrStatsP stats,
 		 * the MCV list is not complete, it's generally worth being more
 		 * selective, and not just filling it all the way up to the stats
 		 * target.  So for an incomplete list, we try to take only MCVs that
-		 * are significantly more common than average.
+		 * are significantly more common than average frequency for values which
+		 * will not be represented in the MCV list.
 		 */
 		if (track_cnt == ndistinct && toowide_cnt == 0 &&
 			stats->stadistinct > 0 &&
@@ -2587,6 +2588,7 @@ compute_scalar_stats(VacAttrStatsP stats,
 		else
 		{
 			double		ndistinct_table = stats->stadistinct;
+			double		values_cnt_remaining = (double) values_cnt;
 			double		avgcount,
 						mincount,
 						maxmincount;
@@ -2594,25 +2596,27 @@ compute_scalar_stats(VacAttrStatsP stats,
 			/* Re-extract estimate of # distinct nonnull values in table */
 			if (ndistinct_table < 0)
 				ndistinct_table = -ndistinct_table * totalrows;
-			/* estimate # occurrences in sample of a typical nonnull value */
-			avgcount = (double) nonnull_cnt / ndistinct_table;
-			/* set minimum threshold count to store a value */
-			mincount = avgcount * 1.25;
-			if (mincount < 2)
-				mincount = 2;
-			/* don't let threshold exceed 1/K, however */
-			maxmincount = (double) values_cnt / (double) num_bins;
-			if (mincount > maxmincount)
-				mincount = maxmincount;
-			if (num_mcv > track_cnt)
-				num_mcv = track_cnt;
-			for (i = 0; i < num_mcv; i++)
-			{
+			/* estimate # of occurrences in sample of a typical nonnull value */
+  			for (i = 0; i < num_mcv; i++)
+  			{
+				avgcount = values_cnt_remaining / ndistinct_table;
+				/* set minimum threshold count to store a value */
+				mincount = avgcount * 1.25;
+				if (mincount < 2)
+					mincount = 2;
+				/* don't let threshold exceed 1/K, however */
+				maxmincount = (double) values_cnt / (double) (num_bins);
+				if (mincount > maxmincount)
+					mincount = maxmincount;
+				if (num_mcv > track_cnt)
+					num_mcv = track_cnt;
 				if (track[i].count < mincount)
 				{
 					num_mcv = i;
 					break;
 				}
+				values_cnt_remaining -= track[i].count;
+				ndistinct_table--;
 			}
 		}

MCV lists for highly skewed distributions

Reply via email to