On Sun, 13 Jan 2019 at 00:04, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > On 1/12/19 8:49 AM, Dean Rasheed wrote: > > A possible refinement would be to say that if there are more than > > stats_target items more common than this mincount threshold, rather than > > excluding the least common ones to get the target number of items, > > exclude the ones closest to their base frequencies, on the grounds that > > those are the ones for which the MCV stats will make the least > > difference. That might complicate the code somewhat though -- I don't > > have it in front of me, so I can't remember if it even tracks more than > > stats_target items. > > Yes, the patch does limit the number of items to stats_target (a maximum > of per-attribute stattarget values, to be precise). IIRC that's a piece > you've added sometime last year ;-) > > I've been experimenting with removing items closest to base frequencies > today, and I came to the conclusion that it's rather tricky for a couple > of reasons. > > 1) How exactly do you measure "closeness" to base frequency? I've tried > computing the error in different ways, including: > > * Max(freq/base, base/freq) > * abs(freq - base) > > but this does not seem to affect the behavior very much, TBH. > > 2) This necessarily reduces mcv_totalsel, i.e. it increases the part not > covered by MCV. And estimates on this part are rather crude. > > 3) It does nothing for "impossible" items, i.e. combinations that do not > exist at all. Clearly, those won't be part of the sample, and so can't > be included in the MCV no matter which error definition we pick. And for > very rare combinations it might lead to sudden changes, depending on > whether the group gets sampled or not. > > So IMHO it's better to stick to the simple SRE approach for now. >
OK, that makes sense. Regards, Dean