On Fri, Dec 4, 2015 at 6:48 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Tue, Dec 1, 2015 at 10:21 AM, Shulgin, Oleksandr > <oleksandr.shul...@zalando.de> wrote: > > > > What I have found is that in a significant percentage of instances, when > a > > duplicate sample value is *not* put into the MCV list, it does produce > > duplicates in the histogram_bounds, so it looks like the MCV cut-off > happens > > too early, even though we have enough space for more values in the MCV > list. > > > > In the extreme cases I've found completely empty MCV lists and histograms > > full of duplicates at the same time, with only about 20% of distinct > values > > in the histogram (as it turns out, this happens due to high fraction of > > NULLs in the sample). > > Wow, this is very interesting work. Using values_cnt rather than > samplerows to compute avgcount seems like a clear improvement. It > doesn't make any sense to raise the threshold for creating an MCV > based on the presence of additional nulls or too-wide values in the > table. I bet compute_distinct_stats needs a similar fix. Yes, and there's also the magic 1.25 multiplier in that code. I think it would make sense to agree first on how exactly the patch for compute_scalar_stats() should look like, then port the relevant bits to compute_distinct_stats(). > But for > plan stability considerations, I'd say we should back-patch this all > the way, but those considerations might mitigate for a more restrained > approach. Still, maybe we should try to sneak at least this much into > 9.5 RSN, because I have to think this is going to help people with > mostly-NULL (or mostly-really-wide) columns. > I'm not sure. Likely people would complain or have found this out on their own if they were seriously affected. What I would be interested is people running the queries I've shown on their data to see if there are any interesting/unexpected patterns. As far as the rest of the fix, your code seems to remove the handling > for ndistinct < 0. That seems unlikely to be correct, although it's > possible that I am missing something. The difference here is that ndistinct at this scope in the original code did hide a variable from an outer scope. That one could be < 0, but in my code there is no inner-scope ndistinct, we are referring to the outer scope variable which cannot be < 0. > Aside from that, the rest of > this seems like a policy change, and I'm not totally sure off-hand > whether it's the right policy. Having more MCVs can increase planning > time noticeably, and the point of the existing cutoff is to prevent us > from choosing MCVs that aren't actually "C". I think this change > significantly undermines those protections. It seems to me that it > might be useful to evaluate the effects of this part of the patch > separately from the samplerows -> values_cnt change. > Yes, that's why I was wondering if frequency cut-off approach might be helpful here. I'm going to have a deeper look at array's typanalyze implementation at the least. -- Alex