Re: Choosing values for multivariate MCV lists

Tomas Vondra Thu, 20 Jun 2019 15:36:17 -0700

On Thu, Jun 20, 2019 at 06:55:41AM +0100, Dean Rasheed wrote:

On Tue, 18 Jun 2019 at 21:59, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:


The current implementation of multi-column MCV lists (added in this
cycle) uses a fairly simple algorithm to pick combinations to include in
the MCV list. We just compute a minimum number of occurences, and then
include all entries sampled more often. See get_mincount_for_mcv_list().

[snip]

This however means that by choosing the MCV entries solely based on the
number of occurrences in the sample, we end up with MCV lists where vast
majority of entries has NULL street name.

That's why we got such poor estimate in the example query, despite the
fact that the city/street combination is the most frequent in the data
set (with non-NULL street name).


I think the fact that they're NULL is a bit of a red herring because
we're treating NULL just like any other value. The same thing would
happen if there were some other very common non-NULL value that
dominated the dataset.


I wasn't really suggesting the NULL is an issue - sorry if that wasn't
clear. It might be any other value, as long as it's very common (and
roughly uniform) in all cities. So yes, I agree with you here.

The other weird thing is that frequency of NULL street names is fairly
uniform in the whole data set. In total about 50% addresses match that,
and for individual cities it's generally between 25% and 100%, so the
estimate is less than 2x off in those cases.

But for addresses with non-NULL street names, the situation is very
different. Some street names are unique to a single city, etc.

Overall, this means we end up with MCV list with entries representing
the mostly-uniform part of the data set, instead of prefering the
entries that are truly skewed.

So I'm wondering if we could/should rethink the algorithm, so look at
the frequency and base_frequency, and maybe pick entries based on their
ratio, or something like that.


Hmm, interesting. I think I would like to see a more rigorous
justification for changing the algorithm deciding which values to
keep.


Sure, I'm not going to pretend my proposals were particularly rigorous, it
was more a collection of random ideas.

If I've understood correctly, I think the problem is this: The
mincount calculation is a good way of identifying MCV candidates to
keep, because it ensures that we don't keep values that don't appear
sufficiently often to produce accurate estimates, and ideally we'd
keep everything with count >= mincount. However, in the case were
there are more than stats_target items with count >= mincount, simply
ordering by count and keeping the most commonly seen values isn't
necessarily the best strategy in the case of multivariate statistics.


Yes.

To identify what the best strategy might be, I think you need to
examine the errors that would occur as a result of *not* keeping a
value in the multivariate MCV list. Given a value that appears with
count >= mincount, N*freq ought to be a reasonable estimate for the
actual number of occurrences of that value in the table, and
N*base_freq ought to be a reasonable estimate for the
univariate-stats-based estimate that it would be given if it weren't
kept in the multivariate MCV list. So the absolute error resulting
from not keeping that value would be

 N * Abs(freq - base_freq)


Yeah. Considering N is the same for all groups in the sample, this
would mean the same thing as Abs(freq - base_freq).

But then I think we ought to take into account how often we're likely
to get that error. If we're simply picking values at random, the
likelihood of getting that value is just its frequency, so the average
average absolute error would be

 Sum( N * freq[i] * Abs(freq[i] - base_freq[i]) )

which suggests that, if we wanted to reduce the average absolute error
of the estimates, we should order by freq*Abs(freq-base_freq) and keep
the top n in the MCV list.


Interesting idea. But I'm not sure it makes much sense to assume the rows
are picked randomly - OTOH we don't really know anything about how the
data will be queries, so we may just as well do that.

On the other hand, if we wanted to reduce the average *relative* error
of the estimates, we might instead order by Abs(freq-base_freq).


Hmmm, yeah. I don't know what's the right choice here, TBH.

For example, we might sort the entries by

    abs(freq - base_freq) / freq


I'm not sure it's easy to justify ordering by Abs(freq-base_freq)/freq
though, because that would seem likely to put too much weight on the
least commonly occurring values.


But would that be an issue, or a good thing? I mean, as long as the item
is above mincount, we take the counts as reliable. As I explained, my
motivation for proposing that was that both

  ... (cost=... rows=1 ...) (actual=... rows=1000001 ...)

and

  ... (cost=... rows=1000000 ...) (actual=... rows=2000000 ...)

have exactly the same Abs(freq - base_freq), but I think we both agree
that the first misestimate is much more dangerous, because it's off by six
orders of magnitude. I think the MCV algorithm should reflect this.

Of course, this is a challenging problem, for a couple of reasons.

Firstly, picking simply the most frequent groups is very simple and it
gives us additional information about the largest group (which may be
useful elsewhere, e.g. the incremental sort patch).


Yes, you would have to keep in mind that changing the algorithm would
mean that the MCV list no longer represented all the most common
values. For example, it would no longer be valid to assume that no
value appeared more often than the first value in the MCV list. I'm
not sure that we currently do that though.


We don't, but I've proposed using some of this knowledge in the
incremental sort patch. But I think we might actually extend the MCV list
to store some extra information (e.g. frequency of the largest groups,
even if we don't store it).

Secondly, if the correlated combinations are less frequent, how reliably
can we even estimate the frequency from a sample? The combination in the
example query was ~0.02% of the data, so how likely it's to be sampled?


I think that's OK as long as we keep the mincount filter in the new
algorithm. I think some experimentation is definitely worthwhile here,
but it looks plausible that a decent approach might be:

1). Discard all values with count < mincount.
2). Order by freq*Abs(freq-base_freq) (or possibly just
Abs(freq-base_freq)) and keep the top n, where n is the stats target.
3). Perhaps re-sort by count, so the final list is in frequency order
again? Not sure if that's a desirable property to keep.


Will try. Thanks for the feedback.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Choosing values for multivariate MCV lists

Reply via email to