Re: [HACKERS] proposal : cross-column stats

2010-12-27 Thread Tomas Vondra
Dne 24.12.2010 13:37, Florian Pflug napsal(a): > On Dec24, 2010, at 11:23 , Nicolas Barbier wrote: > >> 2010/12/24 Florian Pflug : >> >>> On Dec23, 2010, at 20:39 , Tomas Vondra wrote: >>> I guess we could use the highest possible value (equal to the number of tuples) - according to

Re: [HACKERS] proposal : cross-column stats

2010-12-24 Thread Tomas Vondra
Dne 24.12.2010 04:41, Florian Pflug napsal(a): > The filter size could be derived from the table's statistics target, or > be otherwise user-definable. We could also auto-resize once it gets too > full. But still, that all seems awfully complex :-( Using a statistics target is a good idea I think.

Re: [HACKERS] proposal : cross-column stats

2010-12-24 Thread Tomas Vondra
Dne 24.12.2010 13:37, Florian Pflug napsal(a): > On Dec24, 2010, at 11:23 , Nicolas Barbier wrote: > >> 2010/12/24 Florian Pflug : >> >>> On Dec23, 2010, at 20:39 , Tomas Vondra wrote: >>> I guess we could use the highest possible value (equal to the number of tuples) - according to

Re: [HACKERS] proposal : cross-column stats

2010-12-24 Thread Tomas Vondra
Dne 24.12.2010 13:15, t...@fuzzy.cz napsal(a): >> 2010/12/24 Florian Pflug : >> >>> On Dec23, 2010, at 20:39 , Tomas Vondra wrote: >>> I guess we could use the highest possible value (equal to the number of tuples) - according to wiki you need about 10 bits per element with 1%

Re: [HACKERS] proposal : cross-column stats

2010-12-24 Thread Florian Pflug
On Dec24, 2010, at 11:23 , Nicolas Barbier wrote: > 2010/12/24 Florian Pflug : > >> On Dec23, 2010, at 20:39 , Tomas Vondra wrote: >> >>> I guess we could use the highest possible value (equal to the number >>> of tuples) - according to wiki you need about 10 bits per element >>> with 1% e

Re: [HACKERS] proposal : cross-column stats

2010-12-24 Thread tv
> 2010/12/24 Florian Pflug : > >> On Dec23, 2010, at 20:39 , Tomas Vondra wrote: >> >>>   I guess we could use the highest possible value (equal to the number >>>   of tuples) - according to wiki you need about 10 bits per element >>>   with 1% error, i.e. about 10MB of memory for each million of >

Re: [HACKERS] proposal : cross-column stats

2010-12-24 Thread Nicolas Barbier
2010/12/24 Florian Pflug : > On Dec23, 2010, at 20:39 , Tomas Vondra wrote: > >>   I guess we could use the highest possible value (equal to the number >>   of tuples) - according to wiki you need about 10 bits per element >>   with 1% error, i.e. about 10MB of memory for each million of >>   elem

Re: [HACKERS] proposal : cross-column stats

2010-12-23 Thread Florian Pflug
On Dec23, 2010, at 20:39 , Tomas Vondra wrote: > Dne 21.12.2010 16:54, Florian Pflug napsal(a): >> I think that maybe it'd be acceptable to scan a large portion of the >> table to estimate dist(A,B), *if* we must only do so only once in a while. >> But even with >> a full scan, how would we arrive

Re: [HACKERS] proposal : cross-column stats

2010-12-23 Thread Tomas Vondra
Dne 21.12.2010 19:03, Tomas Vondra napsal(a): > My plan for the near future (a few weeks) is to build a simple 'module' > with the ability to estimate the number of rows for a given condition. > This could actually be quite useful as a stand-alone contrib module, as > the users often ask how to get

Re: [HACKERS] proposal : cross-column stats

2010-12-23 Thread Tomas Vondra
Dne 21.12.2010 16:54, Florian Pflug napsal(a): > I think that maybe it'd be acceptable to scan a large portion of the > table to estimate dist(A,B), *if* we must only do so only once in a while. > But even with > a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all > values

Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread Tomas Vondra
Dne 21.12.2010 16:54, Florian Pflug napsal(a): >> Hmmm, maybe we could give this possibility (to identify two separate >> groups of columns) to the user. So instead of 'buid stats for (A,B,C)' the >> user would say 'build stats for (A,B) and (C)' - this actually represents >> apriori knowledge of d

Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread tv
> On Dec21, 2010, at 15:51 , t...@fuzzy.cz wrote: This is the reason why they choose to always combine the values (with varying weights). >>> >>> There are no varying weights involved there. What they do is to express >>> P(A=x,B=y) once as >>> >>> ... >>> >>> P(A=x,B=y) ~= P(B=y|A=x)*P(

Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread Florian Pflug
On Dec21, 2010, at 13:25 , t...@fuzzy.cz wrote: > And there's one additional - IMHO very important - requirement. The whole > thing should easily extend to more than two columns. This "IF (F(A,B) > > F(B,A)) THEN ..." probably is not a good solution regarding this. > > For example given 3 columns

Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread Florian Pflug
On Dec21, 2010, at 15:51 , t...@fuzzy.cz wrote: >>> This is the reason why they choose to always combine the values (with >>> varying weights). >> >> There are no varying weights involved there. What they do is to express >> P(A=x,B=y) once as >> >> ... >> >> P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 +

Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread tv
> On Dec21, 2010, at 11:37 , t...@fuzzy.cz wrote: >> I doubt there is a way to this decision with just dist(A), dist(B) and >> dist(A,B) values. Well, we could go with a rule >> >> if [dist(A) == dist(A,B)] the [A => B] >> >> but that's very fragile. Think about estimates (we're not going to work

Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread Florian Pflug
On Dec21, 2010, at 11:37 , t...@fuzzy.cz wrote: > I doubt there is a way to this decision with just dist(A), dist(B) and > dist(A,B) values. Well, we could go with a rule > > if [dist(A) == dist(A,B)] the [A => B] > > but that's very fragile. Think about estimates (we're not going to work > with

Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread tv
> On Mon, Dec 20, 2010 at 9:29 PM, Florian Pflug wrote: >> You might use that to decide if either A->B or B->a looks function-like >> enough to use the uniform bayesian approach. Or you might even go >> further, >> and decide *with* bayesian formula to use - the paper you cited always >> averages

Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread Robert Haas
On Mon, Dec 20, 2010 at 9:29 PM, Florian Pflug wrote: > I tried to pick up Robert's idea of quantifying "Implicativeness" - > i.e., finding a number between 0 and 1 that describes how close the > (A,B) are to representing a function A -> B. Actually Heikki's idea... > Observe that dist(A),dist(B

Re: [HACKERS] proposal : cross-column stats

2010-12-21 Thread tv
> On Dec18, 2010, at 17:59 , Tomas Vondra wrote: >> It seems to me you're missing one very important thing - this was not >> meant as a new default way to do estimates. It was meant as an option >> when the user (DBA, developer, ...) realizes the current solution gives >> really bad estimates (due

Re: [HACKERS] proposal : cross-column stats

2010-12-20 Thread Florian Pflug
On Dec18, 2010, at 17:59 , Tomas Vondra wrote: > It seems to me you're missing one very important thing - this was not > meant as a new default way to do estimates. It was meant as an option > when the user (DBA, developer, ...) realizes the current solution gives > really bad estimates (due to cor

Re: [HACKERS] proposal : cross-column stats

2010-12-19 Thread Tomas Vondra
Dne 19.12.2010 21:21, Simon Riggs napsal(a): > On Mon, 2010-12-13 at 10:38 -0500, Tom Lane wrote: >> Robert Haas writes: >>> On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra wrote: The proposed solution is based on contingency tables, built for selected groups of columns (not for each poss

Re: [HACKERS] proposal : cross-column stats

2010-12-19 Thread Simon Riggs
On Mon, 2010-12-13 at 10:38 -0500, Tom Lane wrote: > Robert Haas writes: > > On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra wrote: > >> The proposed solution is based on contingency tables, built for selected > >> groups of columns (not for each possible group). And the contingency > >> table give

Re: [HACKERS] proposal : cross-column stats

2010-12-18 Thread Tomas Vondra
Dne 18.12.2010 16:59, Florian Pflug napsal(a): > On Dec18, 2010, at 08:10 , t...@fuzzy.cz wrote: > >>> On Dec17, 2010, at 23:12 , Tomas Vondra wrote: Well, not really - I haven't done any experiments with it. For two columns selectivity equation is (dist(A) * sel(A) + dist(

Re: [HACKERS] proposal : cross-column stats

2010-12-18 Thread Florian Pflug
On Dec18, 2010, at 08:10 , t...@fuzzy.cz wrote: >> On Dec17, 2010, at 23:12 , Tomas Vondra wrote: >>> Well, not really - I haven't done any experiments with it. For two >>> columns selectivity equation is >>> >>> (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B)) >>> >>> where A and B a

Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread tv
> On Dec17, 2010, at 23:12 , Tomas Vondra wrote: >> Well, not really - I haven't done any experiments with it. For two >> columns selectivity equation is >> >> (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B)) >> >> where A and B are columns, dist(X) is number of distinct values in >> co

Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Florian Pflug
On Dec17, 2010, at 23:12 , Tomas Vondra wrote: > Well, not really - I haven't done any experiments with it. For two > columns selectivity equation is > > (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B)) > > where A and B are columns, dist(X) is number of distinct values in > column X

Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Tomas Vondra
Dne 17.12.2010 22:41, Heikki Linnakangas napsal(a): > On 17.12.2010 23:13, Tomas Vondra wrote: >> Dne 17.12.2010 19:58, Robert Haas napsal(a): >>> I haven't read the paper yet (sorry) but just off the top of my head, >>> one possible problem here is that our n_distinct estimates aren't >>> always v

Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Tomas Vondra
Dne 17.12.2010 22:24, Tom Lane napsal(a): > That seems likely to be even more unreliable than our existing > single-column estimates :-( > > regards, tom lane Well, yes :-( I guess this is a place where we could use a multi-column index, if it contains all the interesting c

Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Heikki Linnakangas
On 17.12.2010 23:13, Tomas Vondra wrote: Dne 17.12.2010 19:58, Robert Haas napsal(a): I haven't read the paper yet (sorry) but just off the top of my head, one possible problem here is that our n_distinct estimates aren't always very accurate, especially for large tables. As we've discussed bef

Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Tom Lane
Tomas Vondra writes: > AFAIK it will work with reasonably precise estimates, but the point is > you need an estimate of distinct values of the whole group of columns. > So when you want to get an estimate for queries on columns (a,b), you > need the number of distinct value combinations of these t

Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Tomas Vondra
Dne 17.12.2010 19:58, Robert Haas napsal(a): > I haven't read the paper yet (sorry) but just off the top of my head, > one possible problem here is that our n_distinct estimates aren't > always very accurate, especially for large tables. As we've discussed > before, making them accurate requires s

Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Tomas Vondra
Hi, I've read about 10 papers about estimation this week, and I have gained some insight. I'm not going to describe each of the papers here, I plan to set up a wiki page about cross column statistics http://wiki.postgresql.org/wiki/Cross_Columns_Stats and I'll put the list of papers and some b

Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Robert Haas
On Fri, Dec 17, 2010 at 12:58 PM, Tomas Vondra wrote: > In the end, all they need to compute an estimate is number of distinct > values for each of the columns (we already have that in pg_stats) and a > number of distinct values for the group of columns in a query. They > really don't need any mul

Re: [HACKERS] proposal : cross-column stats

2010-12-17 Thread Tomas Vondra
Dne 12.12.2010 15:43, Heikki Linnakangas napsal(a): > On 12.12.2010 15:17, Martijn van Oosterhout wrote: >> On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote: >> Very cool that you're working on this. > > +1 > >>> Lets talk about one special case - I'll explain how the proposed >>> sol

Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Robert Haas
On Mon, Dec 13, 2010 at 5:45 PM, Tomas Vondra wrote: >> I'll suggest again how to decide *which* columns to cross: whichever >> columns are combined in composite indexes.  In version 2, allow the DBA >> to specify combinations. > I think this is a bit early to discuss this, given the fact that we

Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Pavel Stehule
2010/12/13 Josh Berkus : > Tomas, > >>   (a) find out what statistics do we need to collect and how to use it >>   (b) implement a really stupid inefficient solution >>   (c) optimize in iterations, i.e. making it faster, consuming less >>       space etc. > > I'll suggest again how to decide *whic

Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Tomas Vondra
Dne 13.12.2010 22:50, Josh Berkus napsal(a): > Tomas, > >> (a) find out what statistics do we need to collect and how to use it >> (b) implement a really stupid inefficient solution >> (c) optimize in iterations, i.e. making it faster, consuming less >> space etc. > > I'll suggest aga

Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Josh Berkus
Tomas, > (a) find out what statistics do we need to collect and how to use it > (b) implement a really stupid inefficient solution > (c) optimize in iterations, i.e. making it faster, consuming less > space etc. I'll suggest again how to decide *which* columns to cross: whichever colu

Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Tomas Vondra
Dne 13.12.2010 18:59, Joshua Tolley napsal(a): > On Sun, Dec 12, 2010 at 07:10:44PM -0800, Nathan Boley wrote: >> Another quick note: I think that storing the full contingency table is >> wasteful since the marginals are already stored in the single column >> statistics. Look at copulas [2] ( FWIW

Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Tomas Vondra
Dne 13.12.2010 16:38, Tom Lane napsal(a): > The reason that this wasn't done years ago is precisely that nobody's > figured out how to do it with a tolerable amount of stats data and a > tolerable amount of processing time (both at ANALYZE time and during > query planning). It's not hard to see wh

Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Tomas Vondra
Dne 13.12.2010 16:34, Tom Lane napsal(a): > Tomas Vondra writes: >> Well, until this point we've discussed failure cases involving 'AND' >> conditions. What about 'OR' conditions? I think the current optimizer >> computes the selectivity as 's1+s2 - s1*s2' (at least that's what I >> found in backe

Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Joshua Tolley
On Sun, Dec 12, 2010 at 07:10:44PM -0800, Nathan Boley wrote: > Another quick note: I think that storing the full contingency table is > wasteful since the marginals are already stored in the single column > statistics. Look at copulas [2] ( FWIW I think that Josh Tolley was > looking at this a cou

Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Tom Lane
Robert Haas writes: > On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra wrote: >> The proposed solution is based on contingency tables, built for selected >> groups of columns (not for each possible group). And the contingency >> table gives you the ability to estimate the probabilities needed to >>

Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Tom Lane
Tomas Vondra writes: > Well, until this point we've discussed failure cases involving 'AND' > conditions. What about 'OR' conditions? I think the current optimizer > computes the selectivity as 's1+s2 - s1*s2' (at least that's what I > found in backend/optimizer/path/clausesel.c:630). If you can

Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread tv
> On 2010-12-13 03:28, Robert Haas wrote: >> Well, I'm not real familiar with contingency tables, but it seems like >> you could end up needing to store a huge amount of data to get any >> benefit out of it, in some cases. For example, in the United States, >> there are over 40,000 postal codes, a

Re: [HACKERS] proposal : cross-column stats

2010-12-13 Thread Yeb Havinga
On 2010-12-13 03:28, Robert Haas wrote: Well, I'm not real familiar with contingency tables, but it seems like you could end up needing to store a huge amount of data to get any benefit out of it, in some cases. For example, in the United States, there are over 40,000 postal codes, and some even

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread tv
> On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra wrote: >> Dne 13.12.2010 03:00, Robert Haas napsal(a): >>> Well, the question is what data you are actually storing.  It's >>> appealing to store a measure of the extent to which a constraint on >>> column X constrains column Y, because you'd only ne

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Nathan Boley
>> The proposed solution is based on contingency tables, built for selected >> groups of columns (not for each possible group). And the contingency >> table gives you the ability to estimate the probabilities needed to >> compute the selectivity. Or am I missing something? > > Well, I'm not real fa

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Robert Haas
On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra wrote: > Dne 13.12.2010 03:00, Robert Haas napsal(a): >> Well, the question is what data you are actually storing.  It's >> appealing to store a measure of the extent to which a constraint on >> column X constrains column Y, because you'd only need to

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Tomas Vondra
Dne 13.12.2010 03:00, Robert Haas napsal(a): > Well, the question is what data you are actually storing. It's > appealing to store a measure of the extent to which a constraint on > column X constrains column Y, because you'd only need to store > O(ncolumns^2) values, which would be reasonably com

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Tomas Vondra
> P(A|B) = P(A and B) / P(B). Well, until this point we've discussed failure cases involving 'AND' conditions. What about 'OR' conditions? I think the current optimizer computes the selectivity as 's1+s2 - s1*s2' (at least that's what I found in backend/optimizer/path/clausesel.c:630). Sometimes

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Robert Haas
On Sun, Dec 12, 2010 at 8:46 PM, Tomas Vondra wrote: > Dne 13.12.2010 01:05, Robert Haas napsal(a): >> This is a good idea, but I guess the question is what you do next.  If >> you know that the "applicability" is 100%, you can disregard the >> restriction clause on the implied column.  And if it

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Tomas Vondra
Dne 13.12.2010 01:05, Robert Haas napsal(a): > This is a good idea, but I guess the question is what you do next. If > you know that the "applicability" is 100%, you can disregard the > restriction clause on the implied column. And if it has no > implicatory power, then you just do what we do now

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Robert Haas
On Sun, Dec 12, 2010 at 9:43 AM, Heikki Linnakangas wrote: > The way I think of that problem is that once you know the postcode, knowing > the city name doesn't add any information. The postcode implies the city > name. So the selectivity for "postcode = ? AND city = ?" should be the > selectivity

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Tomas Vondra
Dne 12.12.2010 17:33, Florian Pflug napsal(a): > On Dec12, 2010, at 15:43 , Heikki Linnakangas wrote: >> The way I think of that problem is that once you know the postcode, knowing >> the city name doesn't add any information. The postcode implies the city >> name. So the selectivity for "postcod

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Florian Pflug
On Dec12, 2010, at 15:43 , Heikki Linnakangas wrote: > The way I think of that problem is that once you know the postcode, knowing > the city name doesn't add any information. The postcode implies the city > name. So the selectivity for "postcode = ? AND city = ?" should be the > selectivity of

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Tomas Vondra
Dne 12.12.2010 15:43, Heikki Linnakangas napsal(a): >> The classic failure case has always been: postcodes and city names. >> Strongly correlated, but in a way that the computer can't easily see. > > Yeah, and that's actually analogous to the example I used in my > presentation. > > The way I thi

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Tomas Vondra
Hi! Dne 12.12.2010 15:17, Martijn van Oosterhout napsal(a): >> Lets talk about one special case - I'll explain how the proposed >> solution works, and then I'll explain how to make it more general, what >> improvements are possible, what issues are there. Anyway this is by no >> means a perfect or

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Heikki Linnakangas
On 12.12.2010 15:17, Martijn van Oosterhout wrote: On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote: Very cool that you're working on this. +1 Lets talk about one special case - I'll explain how the proposed solution works, and then I'll explain how to make it more general, what i

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Martijn van Oosterhout
On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote: > Hi everyone, > > one of the ssesion I've attended on PgDay last week was Heikki's session > about statistics in PostgreSQL. One of the issues he mentioned (and one > I regularly run into) is the absence of cross-column stats. When the

[HACKERS] proposal : cross-column stats

2010-12-11 Thread Tomas Vondra
Hi everyone, one of the ssesion I've attended on PgDay last week was Heikki's session about statistics in PostgreSQL. One of the issues he mentioned (and one I regularly run into) is the absence of cross-column stats. When the columns are not independent, this usually result in poor estimates (and