On Tue, 24 Mar 2020 at 01:08, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > > Hmmm. So let's consider a simple OR clause with two arguments, both > covered by single statistics object. Something like this: > > CREATE TABLE t (a int, b int); > INSERT INTO t SELECT mod(i, 10), mod(i, 10) > FROM generate_series(1,100000); > CREATE STATISTICS s (mcv) ON a,b FROM t; > > SELECT * FROM t WHERE a = 0 OR b = 0; > > Which is estimated correctly... >
Hmm, the reason that is estimated correctly is that the MCV values cover all values in the table, so mcv_totalsel is 1 (or pretty close to 1), and other_sel is capped to nearly 0, and so the result is basically just mcv_sel -- i.e., in this case the MCV estimates are returned more-or-less as-is, rather than being applied as a correction, and so the result is independent of how many times extended stats are applied. The more interesting (and probably more realistic) case is where the MCV values do not cover the all values in the table, as in the new mcv_lists_partial examples in the regression tests, for example this test case, which produces a significant overestimate: SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0 although actually, I think there's another reason for that (in addition to the extended stats correction being applied twice) -- the way the MCV code makes use of base selectivity doesn't seem really appropriate for OR clauses, because the way base_frequency is computed is really based on the assumption that every column would be matched. I'm not sure that there's any easy answer to that though. I feels like what's needed when computing the match bitmaps in mcv.c, is to produce a bitmap (would it fit in 1 byte?) per value, to indicate which columns match, and base_frequency values per column. That would be significantly more work though, so almost certainly isn't feasible for PG13. > IIUC the problem you have in mind is that we end up calling > statext_mcv_clauselist_selectivity twice, once for the top-level AND > clause with a single argument, and then recursively for the expanded OR > clause. Indeed, that seems to be applying the correction twice. > > > >I'm not sure what's the best way to resolve that. Perhaps > >statext_mcv_clauselist_selectivity() / statext_is_compatible_clause() > >should ignore OR clauses from an AND-list, on the basis that they will > >get processed recursively later. Or perhaps estimatedclauses can > >somehow be used to prevent this, though I'm not sure exactly how that > >would work. > > I don't know. I feel uneasy about just ignoring some of the clauses, > because what happens for complex clauses, where the OR is not directly > in the AND clause, but is negated or something like that? > > Isn't it the case that clauselist_selectivity_simple (and the OR > variant) should ignore extended stats entirely? That is, we'd need to > add a flag (or _simple variant) to clause_selectivity, so that it calls > causelist_selectivity_simple_or. So the calls would look like this: > > clauselist_selectivity > statext_clauselist_selectivity > statext_mcv_clauselist_selectivity > clauselist_selectivity_simple <--- disable extended stats > clause_selectivity > clauselist_selectivity_simple_or > clause_selectivity > clause_selectivity > mcv_clauselist_selectivity > clauselist_selectivity_simple > already estimated > > I've only quickly hacked clause_selectivity, but it does not seems very > invasive (of course, it means disruption to clause_selectivity callers, > but I suppose most call clauselist_selectivity). > Sounds like a reasonable approach, but I think it would be better to preserve the current public API by having clauselist_selectivity() become a thin wrapper around a new function that optionally applies extended stats. Regards, Dean