On Thu, Oct 16, 2008 at 6:32 PM, Tom Lane <[EMAIL PROTECTED]> wrote: > It appears to me that a lot of people in this thread are confusing > correlation in the sense of statistical correlation between two > variables with correlation in the sense of how well physically-ordered > a column is.
For what it's worth, neither version of correlation was what I had in mind. Statistical correlation between two variables is a single number, is fairly easy to calculate, and probably wouldn't help query plans much at all. I'm more interested in a more complex data gathering. The data model I have in mind (which I note I have *not* proven to actually help a large number of query plans -- that's obviously an important part of what I'd need to do in all this) involves instead a matrix of frequency counts. Right now our "histogram" values are really quantiles; the statistics_target T for a column determines a number of quantiles we'll keep track of, and we grab values from into an ordered list L so that approximately 1/T of the entries in that column fall between values L[n] and L[n+1]. I'm thinking that multicolumn statistics would instead divide the range of each column up into T equally sized segments, to form in the two-column case a matrix, where the values of the matrix are frequency counts -- the number of rows whose values for each column fall within the particular segments of their respective ranges represented by the boundaries of that cell in the matrix. I just realized while writing this that this might not extend to situations where the two columns are from different tables and don't necessarily have the same row count, but I'll have to think about that. Anyway, the size of such a matrix would be exponential in T, and cross-column statistics involving just a few columns could easily involve millions of values, for fairly normal statistics_targets. That's where the compression ideas come in to play. This would obviously need a fair bit of testing, but it's certainly conceivable that modern regression techniques could reduce that frequency matrix to a set of functions with a small number of parameters. Whether that would result in values the planner can look up for a given set of columns without spending more time than it's worth is another question that will need exploring. I started this thread knowing that past discussions have posed the following questions: 1. What sorts of cross-column data can we really use? 2. Can we collect that information? 3. How do we know what columns to track? For what it's worth, my original question was whether anyone had concerns beyond these, and I think that has been fairly well answered in this thread. - Josh / eggyknap -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers