On Mon, Mar 23, 2020 at 08:21:42AM +0000, Dean Rasheed wrote:
On Sat, 21 Mar 2020 at 21:59, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:
Ah, right. Yeah, I think that should work. I thought there would be some
volatility due to groups randomly not making it into the MCV list, but
you're right it's possible to construct the data in a way to make it
perfectly deterministic. So that's what I've done in the attached patch.
Looking through those new tests, another issue occurred to me -- under
some circumstances this patch can lead to extended stats being applied
twice to the same clause, which is not great, because it involves
quite a lot of extra work, and also because it can lead to
overestimates because of the way that MCV stats are applied as a delta
correction to simple_sel.
The way this comes about is as follows -- if we start with an OR
clause, that will be passed as a single-item implicitly AND'ed list to
clauselist_selectivity(), and from there to
statext_clauselist_selectivity() and then to
statext_mcv_clauselist_selectivity(). This will call
clauselist_selectivity_simple() to get simple_sel, before calling
mcv_clauselist_selectivity(), which will recursively compute all the
MCV corrections. However, the call to clauselist_selectivity_simple()
will call clause_selectivity() for each clause (just a single OR
clause in this example), which will now call
clauselist_selectivity_or(), which will go back into
statext_clauselist_selectivity() with "is_or = true", which will apply
the MCV corrections to the same set of clauses that the outer call was
about to process.
Hmmm. So let's consider a simple OR clause with two arguments, both
covered by single statistics object. Something like this:
CREATE TABLE t (a int, b int);
INSERT INTO t SELECT mod(i, 10), mod(i, 10)
FROM generate_series(1,100000);
CREATE STATISTICS s (mcv) ON a,b FROM t;
SELECT * FROM t WHERE a = 0 OR b = 0;
Which is estimated correctly, but the calls graph looks like this:
clauselist_selectivity
statext_clauselist_selectivity
statext_mcv_clauselist_selectivity <---
clauselist_selectivity_simple
clause_selectivity
clauselist_selectivity_or
statext_clauselist_selectivity
statext_mcv_clauselist_selectivity <---
clauselist_selectivity_simple_or
clause_selectivity
clause_selectivity
mcv_clauselist_selectivity
clauselist_selectivity_simple_or
mcv_clauselist_selectivity
clauselist_selectivity_simple
(already estimated)
IIUC the problem you have in mind is that we end up calling
statext_mcv_clauselist_selectivity twice, once for the top-level AND
clause with a single argument, and then recursively for the expanded OR
clause. Indeed, that seems to be applying the correction twice.
I'm not sure what's the best way to resolve that. Perhaps
statext_mcv_clauselist_selectivity() / statext_is_compatible_clause()
should ignore OR clauses from an AND-list, on the basis that they will
get processed recursively later. Or perhaps estimatedclauses can
somehow be used to prevent this, though I'm not sure exactly how that
would work.
I don't know. I feel uneasy about just ignoring some of the clauses,
because what happens for complex clauses, where the OR is not directly
in the AND clause, but is negated or something like that?
Isn't it the case that clauselist_selectivity_simple (and the OR
variant) should ignore extended stats entirely? That is, we'd need to
add a flag (or _simple variant) to clause_selectivity, so that it calls
causelist_selectivity_simple_or. So the calls would look like this:
clauselist_selectivity
statext_clauselist_selectivity
statext_mcv_clauselist_selectivity
clauselist_selectivity_simple <--- disable extended stats
clause_selectivity
clauselist_selectivity_simple_or
clause_selectivity
clause_selectivity
mcv_clauselist_selectivity
clauselist_selectivity_simple
already estimated
I've only quickly hacked clause_selectivity, but it does not seems very
invasive (of course, it means disruption to clause_selectivity callers,
but I suppose most call clauselist_selectivity).
BTW do you have an example where this would actually cause an issue?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services