Hi everyone,

thanks for the reviews. Let me sum the feedback so far, and outline my plans for the next patch version that I'd like to submit for CF 2016-11.


1) syntax changes

I agree with the changes proposed by Dean, although only a subset of the syntax is going to be supported until we add support for either join or partial statistics. So something like this:

 CREATE STATISTICS name
   [ WITH (options) ]
   ON (column1, column2 [, ...])
   FROM table

That should be a difficult change.


2) catalog names

I'm not sure what are the best names, so I'm fine with using whatever is the consensus.

That being said, I'm not sure I like extending the catalog to also support non-multivariate statistics (like for example statistics on expressions). While that would be a clearly useful feature, it seems like a slightly different use case and perhaps a separate catalog would be better. So maybe pg_statistic_ext is not the best name.


3) special data type(s) to store statistics

I agree using an opaque bytea value is not very nice. I see Heikki proposed using something like pg_node_tree, and maybe storing all the statistics in a single value.

I assume the pg_node_tree was meant only as an inspiration how to build pseudo-type on top of a varlena value. I agree that's a good idea, and I plan to do something like that - say adding pg_mcv, pg_histogram, pg_ndistinct and pg_dependencies data types.

Heikki also mentioned that maybe JSONB would be a good way to store the statistics. I don't think so - firstly, it only supports a subset of data types, so we'd be unable to store statistics for some data types (or we'd have to store them as text, which sucks). Also, there's a fair amount of smartness in how the statistics are stored (e.g. how the histogram bucket boundaries are deduplicated, or how the estimation uses the serialized representation directly). We'd lose all of that when using JSONB.

Similarly for storing all the statistics in a single value - I see no reason why keeping the statistics in separate columns would be a bad idea (after all, that's kinda the point of relational databases). Also, there are perfectly valid cases when the caller only needs a particular type of statistic - e.g. when estimating GROUP BY we'll only need the ndistinct coefficients. Why should we force the caller to fetch and detoast everything, and throw away probably 99% of that?

So my plan here is to define pseudo types similar to how pg_node_tree is defined. That does not seem like a tremendous amount of work.


4) functional dependencies

Several people mentioned they don't like how functional dependencies are detected at ANALYZE time, particularly that there's a sudden jump between 0 and 1. Instead, a continuous "dependency degree" between 0 and 1 was proposed.

I'm fine with that, although that makes "clause reduction" (deciding that we don't need to estimate one of the clauses at all, as it's implied by some other clause) impossible. But that's fine, the functional dependencies will still be much less expensive than the other statistics.

I'm wondering how will this interact with transitivity, though. IIRC the current implementation is able to detect transitive dependencies and use that to reduce storage space etc.

In any case, this significantly complicates the functional dependencies, which were meant as a trivial type of statistics, mostly to establish the shared infrastructure. Which brings me to ndistinct.


5) ndistinct

So far, the ndistinct coefficients were lumped at the very end of the patch, and the statistic was only built but not used for any sort of estimation. I agree with Dean that perhaps it'd be better to move this to the very beginning, and use it as the simplest statistic to build the infrastructure instead of functional dependencies (which only gets truer due to the changes in functional dependencies, discussed in the preceding section).

I think it's probably a good idea and I plan to do that, so the patch series will probably look like this:

   * 001 - CREATE STATISTICS infrastucture with ndistinct coefficients
   * 002 - use ndistinct coefficients to improve GROUP BY estimates
   * 003 - use ndistinct coefficients in clausesel.c (not sure)
   * 004 - add functional dependencies (build + clausesel.c)
   * 005 - add multivariate MCV (build + clausesel.c)
   * 006 - add multivariate histograms (build + clausesel.c)

I'm not sure about using the ndistinct coefficients in clausesel.c to estimate regular conditions - it's the place for which ndistinct coefficients were originally proposed by Kyotaro-san, but I seem to remember it was non-trivial to choose the best statistics when there were other types of stats available. But I'll look into that.


6) combining statistics

I've decided not to re-submit this part of the patch until the basic functionality gets in. I do think it's a very useful feature (despite having my doubts about the existing implementation), but it clearly distracts people.

Instead, the patch will use some simple selection strategy (e.g. using a single statistics covering most conditions) or perhaps something more advanced (e.g. non-overlapping statistics). But nothing complicated.


7) enriching the query plan

Sadly, none of the reviews provides any sort of feedback on how to enrich the query plan with information about statistics (instead of doing that in clausesel.c in ad-hoc ephemeral manner).

So I'm still a bit stuck on this :-(


8) join statistics

Not directly related to the current patch, but I recommend reading this paper quantifying impact of each part of query optimizer (estimates, cost model, plan enumeration):

    http://www.vldb.org/pvldb/vol9/p204-leis.pdf

The one conclusion that I take from it is we really need to think about improving the join estimates, somehow. Because it's by far the most significant source of issues (and the hardest one to fix).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to