Dne 24.12.2010 13:37, Florian Pflug napsal(a):
> On Dec24, 2010, at 11:23 , Nicolas Barbier wrote:
>
>> 2010/12/24 Florian Pflug :
>>
>>> On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
>>>
I guess we could use the highest possible value (equal to the number
of tuples) - according to
Dne 24.12.2010 04:41, Florian Pflug napsal(a):
> The filter size could be derived from the table's statistics target, or
> be otherwise user-definable. We could also auto-resize once it gets too
> full. But still, that all seems awfully complex :-(
Using a statistics target is a good idea I think.
Dne 24.12.2010 13:37, Florian Pflug napsal(a):
> On Dec24, 2010, at 11:23 , Nicolas Barbier wrote:
>
>> 2010/12/24 Florian Pflug :
>>
>>> On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
>>>
I guess we could use the highest possible value (equal to the number
of tuples) - according to
Dne 24.12.2010 13:15, t...@fuzzy.cz napsal(a):
>> 2010/12/24 Florian Pflug :
>>
>>> On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
>>>
I guess we could use the highest possible value (equal to the number
of tuples) - according to wiki you need about 10 bits per element
with 1%
On Dec24, 2010, at 11:23 , Nicolas Barbier wrote:
> 2010/12/24 Florian Pflug :
>
>> On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
>>
>>> I guess we could use the highest possible value (equal to the number
>>> of tuples) - according to wiki you need about 10 bits per element
>>> with 1% e
> 2010/12/24 Florian Pflug :
>
>> On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
>>
>>> I guess we could use the highest possible value (equal to the number
>>> of tuples) - according to wiki you need about 10 bits per element
>>> with 1% error, i.e. about 10MB of memory for each million of
>
2010/12/24 Florian Pflug :
> On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
>
>> I guess we could use the highest possible value (equal to the number
>> of tuples) - according to wiki you need about 10 bits per element
>> with 1% error, i.e. about 10MB of memory for each million of
>> elem
On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
> Dne 21.12.2010 16:54, Florian Pflug napsal(a):
>> I think that maybe it'd be acceptable to scan a large portion of the
>> table to estimate dist(A,B), *if* we must only do so only once in a while.
>> But even with
>> a full scan, how would we arrive
Dne 21.12.2010 19:03, Tomas Vondra napsal(a):
> My plan for the near future (a few weeks) is to build a simple 'module'
> with the ability to estimate the number of rows for a given condition.
> This could actually be quite useful as a stand-alone contrib module, as
> the users often ask how to get
Dne 21.12.2010 16:54, Florian Pflug napsal(a):
> I think that maybe it'd be acceptable to scan a large portion of the
> table to estimate dist(A,B), *if* we must only do so only once in a while.
> But even with
> a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all
> values
Dne 21.12.2010 16:54, Florian Pflug napsal(a):
>> Hmmm, maybe we could give this possibility (to identify two separate
>> groups of columns) to the user. So instead of 'buid stats for (A,B,C)' the
>> user would say 'build stats for (A,B) and (C)' - this actually represents
>> apriori knowledge of d
> On Dec21, 2010, at 15:51 , t...@fuzzy.cz wrote:
This is the reason why they choose to always combine the values (with
varying weights).
>>>
>>> There are no varying weights involved there. What they do is to express
>>> P(A=x,B=y) once as
>>>
>>> ...
>>>
>>> P(A=x,B=y) ~= P(B=y|A=x)*P(
On Dec21, 2010, at 13:25 , t...@fuzzy.cz wrote:
> And there's one additional - IMHO very important - requirement. The whole
> thing should easily extend to more than two columns. This "IF (F(A,B) >
> F(B,A)) THEN ..." probably is not a good solution regarding this.
>
> For example given 3 columns
On Dec21, 2010, at 15:51 , t...@fuzzy.cz wrote:
>>> This is the reason why they choose to always combine the values (with
>>> varying weights).
>>
>> There are no varying weights involved there. What they do is to express
>> P(A=x,B=y) once as
>>
>> ...
>>
>> P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 +
> On Dec21, 2010, at 11:37 , t...@fuzzy.cz wrote:
>> I doubt there is a way to this decision with just dist(A), dist(B) and
>> dist(A,B) values. Well, we could go with a rule
>>
>> if [dist(A) == dist(A,B)] the [A => B]
>>
>> but that's very fragile. Think about estimates (we're not going to work
On Dec21, 2010, at 11:37 , t...@fuzzy.cz wrote:
> I doubt there is a way to this decision with just dist(A), dist(B) and
> dist(A,B) values. Well, we could go with a rule
>
> if [dist(A) == dist(A,B)] the [A => B]
>
> but that's very fragile. Think about estimates (we're not going to work
> with
> On Mon, Dec 20, 2010 at 9:29 PM, Florian Pflug wrote:
>> You might use that to decide if either A->B or B->a looks function-like
>> enough to use the uniform bayesian approach. Or you might even go
>> further,
>> and decide *with* bayesian formula to use - the paper you cited always
>> averages
On Mon, Dec 20, 2010 at 9:29 PM, Florian Pflug wrote:
> I tried to pick up Robert's idea of quantifying "Implicativeness" -
> i.e., finding a number between 0 and 1 that describes how close the
> (A,B) are to representing a function A -> B.
Actually Heikki's idea...
> Observe that dist(A),dist(B
> On Dec18, 2010, at 17:59 , Tomas Vondra wrote:
>> It seems to me you're missing one very important thing - this was not
>> meant as a new default way to do estimates. It was meant as an option
>> when the user (DBA, developer, ...) realizes the current solution gives
>> really bad estimates (due
On Dec18, 2010, at 17:59 , Tomas Vondra wrote:
> It seems to me you're missing one very important thing - this was not
> meant as a new default way to do estimates. It was meant as an option
> when the user (DBA, developer, ...) realizes the current solution gives
> really bad estimates (due to cor
Dne 19.12.2010 21:21, Simon Riggs napsal(a):
> On Mon, 2010-12-13 at 10:38 -0500, Tom Lane wrote:
>> Robert Haas writes:
>>> On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra wrote:
The proposed solution is based on contingency tables, built for selected
groups of columns (not for each poss
On Mon, 2010-12-13 at 10:38 -0500, Tom Lane wrote:
> Robert Haas writes:
> > On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra wrote:
> >> The proposed solution is based on contingency tables, built for selected
> >> groups of columns (not for each possible group). And the contingency
> >> table give
Dne 18.12.2010 16:59, Florian Pflug napsal(a):
> On Dec18, 2010, at 08:10 , t...@fuzzy.cz wrote:
>
>>> On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
Well, not really - I haven't done any experiments with it. For two
columns selectivity equation is
(dist(A) * sel(A) + dist(
On Dec18, 2010, at 08:10 , t...@fuzzy.cz wrote:
>> On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
>>> Well, not really - I haven't done any experiments with it. For two
>>> columns selectivity equation is
>>>
>>> (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))
>>>
>>> where A and B a
> On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
>> Well, not really - I haven't done any experiments with it. For two
>> columns selectivity equation is
>>
>> (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))
>>
>> where A and B are columns, dist(X) is number of distinct values in
>> co
On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
> Well, not really - I haven't done any experiments with it. For two
> columns selectivity equation is
>
> (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))
>
> where A and B are columns, dist(X) is number of distinct values in
> column X
Dne 17.12.2010 22:41, Heikki Linnakangas napsal(a):
> On 17.12.2010 23:13, Tomas Vondra wrote:
>> Dne 17.12.2010 19:58, Robert Haas napsal(a):
>>> I haven't read the paper yet (sorry) but just off the top of my head,
>>> one possible problem here is that our n_distinct estimates aren't
>>> always v
Dne 17.12.2010 22:24, Tom Lane napsal(a):
> That seems likely to be even more unreliable than our existing
> single-column estimates :-(
>
> regards, tom lane
Well, yes :-(
I guess this is a place where we could use a multi-column index, if it
contains all the interesting c
On 17.12.2010 23:13, Tomas Vondra wrote:
Dne 17.12.2010 19:58, Robert Haas napsal(a):
I haven't read the paper yet (sorry) but just off the top of my head,
one possible problem here is that our n_distinct estimates aren't
always very accurate, especially for large tables. As we've discussed
bef
Tomas Vondra writes:
> AFAIK it will work with reasonably precise estimates, but the point is
> you need an estimate of distinct values of the whole group of columns.
> So when you want to get an estimate for queries on columns (a,b), you
> need the number of distinct value combinations of these t
Dne 17.12.2010 19:58, Robert Haas napsal(a):
> I haven't read the paper yet (sorry) but just off the top of my head,
> one possible problem here is that our n_distinct estimates aren't
> always very accurate, especially for large tables. As we've discussed
> before, making them accurate requires s
Hi,
I've read about 10 papers about estimation this week, and I have gained
some insight. I'm not going to describe each of the papers here, I plan
to set up a wiki page about cross column statistics
http://wiki.postgresql.org/wiki/Cross_Columns_Stats
and I'll put the list of papers and some b
On Fri, Dec 17, 2010 at 12:58 PM, Tomas Vondra wrote:
> In the end, all they need to compute an estimate is number of distinct
> values for each of the columns (we already have that in pg_stats) and a
> number of distinct values for the group of columns in a query. They
> really don't need any mul
Dne 12.12.2010 15:43, Heikki Linnakangas napsal(a):
> On 12.12.2010 15:17, Martijn van Oosterhout wrote:
>> On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote:
>> Very cool that you're working on this.
>
> +1
>
>>> Lets talk about one special case - I'll explain how the proposed
>>> sol
On Mon, Dec 13, 2010 at 5:45 PM, Tomas Vondra wrote:
>> I'll suggest again how to decide *which* columns to cross: whichever
>> columns are combined in composite indexes. In version 2, allow the DBA
>> to specify combinations.
> I think this is a bit early to discuss this, given the fact that we
2010/12/13 Josh Berkus :
> Tomas,
>
>> (a) find out what statistics do we need to collect and how to use it
>> (b) implement a really stupid inefficient solution
>> (c) optimize in iterations, i.e. making it faster, consuming less
>> space etc.
>
> I'll suggest again how to decide *whic
Dne 13.12.2010 22:50, Josh Berkus napsal(a):
> Tomas,
>
>> (a) find out what statistics do we need to collect and how to use it
>> (b) implement a really stupid inefficient solution
>> (c) optimize in iterations, i.e. making it faster, consuming less
>> space etc.
>
> I'll suggest aga
Tomas,
> (a) find out what statistics do we need to collect and how to use it
> (b) implement a really stupid inefficient solution
> (c) optimize in iterations, i.e. making it faster, consuming less
> space etc.
I'll suggest again how to decide *which* columns to cross: whichever
colu
Dne 13.12.2010 18:59, Joshua Tolley napsal(a):
> On Sun, Dec 12, 2010 at 07:10:44PM -0800, Nathan Boley wrote:
>> Another quick note: I think that storing the full contingency table is
>> wasteful since the marginals are already stored in the single column
>> statistics. Look at copulas [2] ( FWIW
Dne 13.12.2010 16:38, Tom Lane napsal(a):
> The reason that this wasn't done years ago is precisely that nobody's
> figured out how to do it with a tolerable amount of stats data and a
> tolerable amount of processing time (both at ANALYZE time and during
> query planning). It's not hard to see wh
Dne 13.12.2010 16:34, Tom Lane napsal(a):
> Tomas Vondra writes:
>> Well, until this point we've discussed failure cases involving 'AND'
>> conditions. What about 'OR' conditions? I think the current optimizer
>> computes the selectivity as 's1+s2 - s1*s2' (at least that's what I
>> found in backe
On Sun, Dec 12, 2010 at 07:10:44PM -0800, Nathan Boley wrote:
> Another quick note: I think that storing the full contingency table is
> wasteful since the marginals are already stored in the single column
> statistics. Look at copulas [2] ( FWIW I think that Josh Tolley was
> looking at this a cou
Robert Haas writes:
> On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra wrote:
>> The proposed solution is based on contingency tables, built for selected
>> groups of columns (not for each possible group). And the contingency
>> table gives you the ability to estimate the probabilities needed to
>>
Tomas Vondra writes:
> Well, until this point we've discussed failure cases involving 'AND'
> conditions. What about 'OR' conditions? I think the current optimizer
> computes the selectivity as 's1+s2 - s1*s2' (at least that's what I
> found in backend/optimizer/path/clausesel.c:630).
If you can
> On 2010-12-13 03:28, Robert Haas wrote:
>> Well, I'm not real familiar with contingency tables, but it seems like
>> you could end up needing to store a huge amount of data to get any
>> benefit out of it, in some cases. For example, in the United States,
>> there are over 40,000 postal codes, a
On 2010-12-13 03:28, Robert Haas wrote:
Well, I'm not real familiar with contingency tables, but it seems like
you could end up needing to store a huge amount of data to get any
benefit out of it, in some cases. For example, in the United States,
there are over 40,000 postal codes, and some even
> On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra wrote:
>> Dne 13.12.2010 03:00, Robert Haas napsal(a):
>>> Well, the question is what data you are actually storing. It's
>>> appealing to store a measure of the extent to which a constraint on
>>> column X constrains column Y, because you'd only ne
>> The proposed solution is based on contingency tables, built for selected
>> groups of columns (not for each possible group). And the contingency
>> table gives you the ability to estimate the probabilities needed to
>> compute the selectivity. Or am I missing something?
>
> Well, I'm not real fa
On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra wrote:
> Dne 13.12.2010 03:00, Robert Haas napsal(a):
>> Well, the question is what data you are actually storing. It's
>> appealing to store a measure of the extent to which a constraint on
>> column X constrains column Y, because you'd only need to
Dne 13.12.2010 03:00, Robert Haas napsal(a):
> Well, the question is what data you are actually storing. It's
> appealing to store a measure of the extent to which a constraint on
> column X constrains column Y, because you'd only need to store
> O(ncolumns^2) values, which would be reasonably com
> P(A|B) = P(A and B) / P(B).
Well, until this point we've discussed failure cases involving 'AND'
conditions. What about 'OR' conditions? I think the current optimizer
computes the selectivity as 's1+s2 - s1*s2' (at least that's what I
found in backend/optimizer/path/clausesel.c:630).
Sometimes
On Sun, Dec 12, 2010 at 8:46 PM, Tomas Vondra wrote:
> Dne 13.12.2010 01:05, Robert Haas napsal(a):
>> This is a good idea, but I guess the question is what you do next. If
>> you know that the "applicability" is 100%, you can disregard the
>> restriction clause on the implied column. And if it
Dne 13.12.2010 01:05, Robert Haas napsal(a):
> This is a good idea, but I guess the question is what you do next. If
> you know that the "applicability" is 100%, you can disregard the
> restriction clause on the implied column. And if it has no
> implicatory power, then you just do what we do now
On Sun, Dec 12, 2010 at 9:43 AM, Heikki Linnakangas
wrote:
> The way I think of that problem is that once you know the postcode, knowing
> the city name doesn't add any information. The postcode implies the city
> name. So the selectivity for "postcode = ? AND city = ?" should be the
> selectivity
Dne 12.12.2010 17:33, Florian Pflug napsal(a):
> On Dec12, 2010, at 15:43 , Heikki Linnakangas wrote:
>> The way I think of that problem is that once you know the postcode, knowing
>> the city name doesn't add any information. The postcode implies the city
>> name. So the selectivity for "postcod
On Dec12, 2010, at 15:43 , Heikki Linnakangas wrote:
> The way I think of that problem is that once you know the postcode, knowing
> the city name doesn't add any information. The postcode implies the city
> name. So the selectivity for "postcode = ? AND city = ?" should be the
> selectivity of
Dne 12.12.2010 15:43, Heikki Linnakangas napsal(a):
>> The classic failure case has always been: postcodes and city names.
>> Strongly correlated, but in a way that the computer can't easily see.
>
> Yeah, and that's actually analogous to the example I used in my
> presentation.
>
> The way I thi
Hi!
Dne 12.12.2010 15:17, Martijn van Oosterhout napsal(a):
>> Lets talk about one special case - I'll explain how the proposed
>> solution works, and then I'll explain how to make it more general, what
>> improvements are possible, what issues are there. Anyway this is by no
>> means a perfect or
On 12.12.2010 15:17, Martijn van Oosterhout wrote:
On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote:
Very cool that you're working on this.
+1
Lets talk about one special case - I'll explain how the proposed
solution works, and then I'll explain how to make it more general, what
i
On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote:
> Hi everyone,
>
> one of the ssesion I've attended on PgDay last week was Heikki's session
> about statistics in PostgreSQL. One of the issues he mentioned (and one
> I regularly run into) is the absence of cross-column stats. When the
Hi everyone,
one of the ssesion I've attended on PgDay last week was Heikki's session
about statistics in PostgreSQL. One of the issues he mentioned (and one
I regularly run into) is the absence of cross-column stats. When the
columns are not independent, this usually result in poor estimates (and
61 matches
Mail list logo