Re: cost_sort() improvements

Teodor Sigaev Thu, 12 Jul 2018 08:01:15 -0700

One more thought about estimating the worst case - I wonder if simply
multiplying the per-tuple cost by 1.5 is the right approach. It does not
seem particularly principled, and it's trivial simple to construct
counter-examples defeating it (imagine columns with 99% of the rows
having the same value, and then many values in exactly one row - that
will defeat the ndistinct-based group size estimates).

Agree. But right now that is best what we have. and constant 1.5 easily could bechanged to 1 or 10 - it is just arbitrary value, intuitively chosen. As Imentioned in v7 patch description, new estimation function provides ~10% biggerestimation and this constant should not be very large, because we could easilyget overestimation.


The reason why constructing such counter-examples is so simple is that
this relies entirely on the ndistinct stats, ignoring the other stats we
already have. I think this might leverage the per-column MCV lists (and
eventually multi-column MCV lists, if that ever gets committed).

We're estimating the number of tuples in group (after fixing values in
the preceding columns), because that's what determines the number of
comparisons we need to make at a given stage. The patch does this by
estimating the average group size, by estimating ndistinct of the
preceding columns G(i-1), and computing ntuples/G(i-1).

But consider that we also have MCV lists for each column - if there is a
very common value, it should be there. So let's say M(i) is a frequency
of the most common value in i-th column, determined either from the MCV
list or as 1/ndistinct for that column. Then if m(i) is minimum of M(i)
for the fist i columns, then the largest possible group of values when
comparing values in i-th column is

     N * m(i-1)

This seems like a better way to estimate the worst case. I don't know if
this should be used instead of NG(i), or if those two estimates should
be combined in some way.

Agree, definitely we need an estimation of larger group size to use it incost_sort. But I don't feel power to do that, pls, could you attack a computinggroup size issue? Thank you.


I think estimating the largest group we need to sort should be helpful
for the incremental sort patch, so I'm adding Alexander to this thread.Agree

I think so. suggested estimation algorithm should be modified to support leadingpreordered keys and this form could be directly used in GROUP BY and incrementalsort patches.

--
Teodor Sigaev                                   E-mail: [email protected]
                                                   WWW: http://www.sigaev.ru/

Re: cost_sort() improvements

Reply via email to