Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics

Quan Zongliang Fri, 16 Jun 2023 17:03:20 -0700



On 2023/6/17 06:46, Tom Lane wrote:

Quan Zongliang <[email protected]> writes:

Perhaps we should discard this (dups cnt > 1) restriction?


That's not going to happen on the basis of one test case that you
haven't even shown us.  The implications of doing it are very unclear.
In particular, I seem to recall that there are bits of logic that
depend on the assumption that MCV entries always represent more than
one row.  The nmultiple calculation Tomas referred to may be failing
because of that, but I'm worried about there being other places.


The statistics for the other table look like this:
stadistinct | 6
stanumbers1 | {0.50096667,0.49736667,0.0012}
stavalues1  | {v22,v23,v5}

The value that appears twice in the small table (v1 and v2) does notappear here. The stadistinct's true value is 18 instead of 6 (threevalues in the small table do not appear here).


When calculating the selectivity:
if (nd2 > sslot2->nvalues)
  totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - sslot2->nvalues);

totalsel1 = 0
nd2 = 21
sslot2->nvalues = 2
unmatchfreq1 = 0.99990002016420476
otherfreq2 = 0.82608695328235626

result: totalsel1 = 0.043473913749706022
rows = 0.043473913749706022 * 23 * 2,000,000 = 1999800

Basically, you're proposing a rather fundamental change in the rules
by which Postgres has gathered statistics for decades.  You need to
bring some pretty substantial evidence to support that.  The burden
of proof is on you, not on the status quo.

                        regards, tom lane

Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics

Reply via email to