Re: Should the function get_variable_numdistinct consider the case when stanullfrac is 1.0?

Tom Lane Fri, 30 Oct 2020 17:52:05 -0700

Tomas Vondra <tomas.von...@2ndquadrant.com> writes:
> So I'm not sure I understand what would be the risk with this ... Tom,
> can you elaborate why you dislike the patch?


I've got a couple issues with the patch as presented.

* As you said, it creates discontinuous behavior for stanullfrac = 1.0
versus stanullfrac = 1.0 - epsilon.  That doesn't seem good.

* It's not apparent why, if ANALYZE's sample is all nulls, we wouldn't
conclude stadistinct = 0 and thus arrive at the desired answer that
way.  (Since we have a complaint, I'm guessing that ANALYZE might
disbelieve its own result and stick in some larger stadistinct.  But
then maybe that's where to fix this, not here.)

* We generally disbelieve edge-case estimates to begin with.  The
most obvious example is that we don't accept rowcount estimates that
are zero.  There are also some clamps that disbelieve selectivities
approaching 0.0 or 1.0 when estimating from a histogram, and I think
we have a couple other similar rules.  The reason for this is mainly
that taking such estimates at face value creates too much risk of
severe relative error due to imprecise or out-of-date statistics.
So a special case for stanullfrac = 1.0 seems to go directly against
that mindset.

I agree that there might be some gold to be mined in this area,
as we haven't thought particularly hard about high-stanullfrac
situations.  One idea is to figure what stanullfrac says about the
number of non-null rows, and clamp the get_variable_numdistinct
result to be not more than that.  But I still would not want to
trust an exact zero result.

                        regards, tom lane

Re: Should the function get_variable_numdistinct consider the case when stanullfrac is 1.0?

Reply via email to