On Fri, Dec 4, 2015 at 6:48 PM, Robert Haas <robertmh...@gmail.com> wrote:

> On Tue, Dec 1, 2015 at 10:21 AM, Shulgin, Oleksandr
> <oleksandr.shul...@zalando.de> wrote:
> >
> > What I have found is that in a significant percentage of instances, when
> a
> > duplicate sample value is *not* put into the MCV list, it does produce
> > duplicates in the histogram_bounds, so it looks like the MCV cut-off
> happens
> > too early, even though we have enough space for more values in the MCV
> list.
> >
> > In the extreme cases I've found completely empty MCV lists and histograms
> > full of duplicates at the same time, with only about 20% of distinct
> values
> > in the histogram (as it turns out, this happens due to high fraction of
> > NULLs in the sample).
>
> Wow, this is very interesting work.  Using values_cnt rather than
> samplerows to compute avgcount seems like a clear improvement.  It
> doesn't make any sense to raise the threshold for creating an MCV
> based on the presence of additional nulls or too-wide values in the
> table.  I bet compute_distinct_stats needs a similar fix.


Yes, and there's also the magic 1.25 multiplier in that code.  I think it
would make sense to agree first on how exactly the patch for
compute_scalar_stats() should look like, then port the relevant bits to
compute_distinct_stats().


> But for
> plan stability considerations, I'd say we should back-patch this all
> the way, but those considerations might mitigate for a more restrained
> approach.  Still, maybe we should try to sneak at least this much into
> 9.5 RSN, because I have to think this is going to help people with
> mostly-NULL (or mostly-really-wide) columns.
>

I'm not sure.  Likely people would complain or have found this out on their
own if they were seriously affected.

What I would be interested is people running the queries I've shown on
their data to see if there are any interesting/unexpected patterns.

As far as the rest of the fix, your code seems to remove the handling
> for ndistinct < 0.  That seems unlikely to be correct, although it's
> possible that I am missing something.


The difference here is that ndistinct at this scope in the original code
did hide a variable from an outer scope.  That one could be < 0, but in my
code there is no inner-scope ndistinct, we are referring to the outer scope
variable which cannot be < 0.


> Aside from that, the rest of
> this seems like a policy change, and I'm not totally sure off-hand
> whether it's the right policy.  Having more MCVs can increase planning
> time noticeably, and the point of the existing cutoff is to prevent us
> from choosing MCVs that aren't actually "C".  I think this change
> significantly undermines those protections.  It seems to me that it
> might be useful to evaluate the effects of this part of the patch
> separately from the samplerows -> values_cnt change.
>

Yes, that's why I was wondering if frequency cut-off approach might be
helpful here.  I'm going to have a deeper look at array's typanalyze
implementation at the least.

--
Alex

Reply via email to