On 12/28/17, Jeff Janes <jeff.ja...@gmail.com> wrote: > I want to revive a patch I sent couple years ago to the performance list, > as I have seen the issue pop up repeatedly since then.
> If we stored just a few more values, their inclusion in the MCV would mean > they are depleted from the residual count, correctly lowering the estimate > we would get for very rare values not included in the sample. > > So instead of having the threshold of 1.25x the average frequency over all > values, I think we should use 1.25x the average frequency of only those > values not already included in the MCV, as in the attached. After reading the thread you mentioned, I think that's a good approach. On master, the numerator of avg_count is nonnull_cnt, but you've (indirectly) set it to values_cnt. You didn't mention it here, but I think you're right and master is wrong, since in this function only sortable values go through the MCV path. If I understand the code correctly, if there are enough too-wide values in the sample, they could bias the average and prevent normal values from being considered for the MCV list. Am I off base here? Anyway, since you're now overwriting ndistinct_table, I would rename it to ndistinct_remaining for clarity's sake. This part doesn't use any loop variables, so should happen before the loop: + if (num_mcv > track_cnt) + num_mcv = track_cnt; I think this comment /* estimate # of occurrences in sample of a typical nonnull value */ belongs just above the calculation of avg_count. > I think that perhaps maxmincount should also use the dynamic > values_cnt_remaining rather than the static one. After all, things > included in the MCV don' get represented in the histogram. When I've seen > planning problems due to skewed distributions I also usually see redundant > values in the histogram boundary list which I think should be in the MCV > list instead. But I have not changed that here, pending discussion. I think this is also a good idea, but I haven't thought it through. If you don't go this route, I would move this section back out of the loop as well. -John Naylor