Re: Using multiple extended statistics for estimates

Tomas Vondra Wed, 06 Nov 2019 11:55:30 -0800

On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote:

Hi,


PostgreSQL 10 introduced extended statistics, allowing us to consider
correlation between columns to improve estimates, and PostgreSQL 12
added support for MCV statistics. But we still had the limitation that
we only allowed using a single extended statistics per relation, i.e.
given a table with two extended stats

  CREATE TABLE t (a int, b int, c int, d int);
  CREATE STATISTICS s1 (mcv) ON a, b FROM t;
  CREATE STATISTICS s2 (mcv) ON c, d FROM t;

and a query

  SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1;

we only ever used one of the statistics (and we considered them in a not
particularly well determined order).

This patch addresses this by using as many extended stats as possible,
by adding a loop to statext_mcv_clauselist_selectivity(). In each step
we pick the "best" applicable statistics (in the sense of covering the
most attributes) and factor it into the oveall estimate.

All this happens where we'd originally consider applying a single MCV
list, i.e. before even considering the functional dependencies, so
roughly like this:

  while ()
  {
      ... apply another MCV list ...
  }

  ... apply functional dependencies ...


I've both in the loop, but I think that'd be wrong - the MCV list is
expected to contain more information about individual values (compared
to functional deps, which are column-level).


Here is a slightly polished v2 of the patch, the main difference being
that computing clause_attnums was moved to a separate function.

This is a fairly simple patch, and it's not entirely new functionality
(applying multiple statistics was part of the very first patch seris,
although of course in a very different form). So unless there are
objections, I'd like to get this committed sometime next week.

There's room for improvement, of course, for example when handling
overlapping statistics. Consider a table with columns (a,b,c) and two
extended statistics on (a,b) and (b,c), and query with one clause per
column

  SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1

In this case the patch does not help, because we apply (a,b) and then we
have just a single clause remaining. What we could do is still apply the
(b,c) statistic, using the already-estimated clause on b as a condition.
So essentially we'd compute

   P(a=1 && b=1) * P(c=1 | b=1)

But that'll require larger changes, and I see it as an evolution of the
current patch.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Using multiple extended statistics for estimates

Reply via email to