Dne 24.12.2010 13:37, Florian Pflug napsal(a):
> On Dec24, 2010, at 11:23 , Nicolas Barbier wrote:
>
>> 2010/12/24 Florian Pflug :
>>
>>> On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
>>>
I guess we could use the highest possible value (equal to the number
of tuples) - according to
Dne 24.12.2010 04:41, Florian Pflug napsal(a):
> The filter size could be derived from the table's statistics target, or
> be otherwise user-definable. We could also auto-resize once it gets too
> full. But still, that all seems awfully complex :-(
Using a statistics target is a good idea I think.
Dne 24.12.2010 13:37, Florian Pflug napsal(a):
> On Dec24, 2010, at 11:23 , Nicolas Barbier wrote:
>
>> 2010/12/24 Florian Pflug :
>>
>>> On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
>>>
I guess we could use the highest possible value (equal to the number
of tuples) - according to
Dne 24.12.2010 13:15, t...@fuzzy.cz napsal(a):
>> 2010/12/24 Florian Pflug :
>>
>>> On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
>>>
I guess we could use the highest possible value (equal to the number
of tuples) - according to wiki you need about 10 bits per element
with 1%
On Dec24, 2010, at 11:23 , Nicolas Barbier wrote:
> 2010/12/24 Florian Pflug :
>
>> On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
>>
>>> I guess we could use the highest possible value (equal to the number
>>> of tuples) - according to wiki you need about 10 bits per element
>>> with 1% e
> 2010/12/24 Florian Pflug :
>
>> On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
>>
>>> I guess we could use the highest possible value (equal to the number
>>> of tuples) - according to wiki you need about 10 bits per element
>>> with 1% error, i.e. about 10MB of memory for each million of
>
2010/12/24 Florian Pflug :
> On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
>
>> I guess we could use the highest possible value (equal to the number
>> of tuples) - according to wiki you need about 10 bits per element
>> with 1% error, i.e. about 10MB of memory for each million of
>> elem
On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
> Dne 21.12.2010 16:54, Florian Pflug napsal(a):
>> I think that maybe it'd be acceptable to scan a large portion of the
>> table to estimate dist(A,B), *if* we must only do so only once in a while.
>> But even with
>> a full scan, how would we arrive
Dne 21.12.2010 19:03, Tomas Vondra napsal(a):
> My plan for the near future (a few weeks) is to build a simple 'module'
> with the ability to estimate the number of rows for a given condition.
> This could actually be quite useful as a stand-alone contrib module, as
> the users often ask how to get
Dne 21.12.2010 16:54, Florian Pflug napsal(a):
> I think that maybe it'd be acceptable to scan a large portion of the
> table to estimate dist(A,B), *if* we must only do so only once in a while.
> But even with
> a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all
> values
Dne 21.12.2010 16:54, Florian Pflug napsal(a):
>> Hmmm, maybe we could give this possibility (to identify two separate
>> groups of columns) to the user. So instead of 'buid stats for (A,B,C)' the
>> user would say 'build stats for (A,B) and (C)' - this actually represents
>> apriori knowledge of d
> On Dec21, 2010, at 15:51 , t...@fuzzy.cz wrote:
This is the reason why they choose to always combine the values (with
varying weights).
>>>
>>> There are no varying weights involved there. What they do is to express
>>> P(A=x,B=y) once as
>>>
>>> ...
>>>
>>> P(A=x,B=y) ~= P(B=y|A=x)*P(
On Dec21, 2010, at 13:25 , t...@fuzzy.cz wrote:
> And there's one additional - IMHO very important - requirement. The whole
> thing should easily extend to more than two columns. This "IF (F(A,B) >
> F(B,A)) THEN ..." probably is not a good solution regarding this.
>
> For example given 3 columns
On Dec21, 2010, at 15:51 , t...@fuzzy.cz wrote:
>>> This is the reason why they choose to always combine the values (with
>>> varying weights).
>>
>> There are no varying weights involved there. What they do is to express
>> P(A=x,B=y) once as
>>
>> ...
>>
>> P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 +
> On Dec21, 2010, at 11:37 , t...@fuzzy.cz wrote:
>> I doubt there is a way to this decision with just dist(A), dist(B) and
>> dist(A,B) values. Well, we could go with a rule
>>
>> if [dist(A) == dist(A,B)] the [A => B]
>>
>> but that's very fragile. Think about estimates (we're not going to work
On Dec21, 2010, at 11:37 , t...@fuzzy.cz wrote:
> I doubt there is a way to this decision with just dist(A), dist(B) and
> dist(A,B) values. Well, we could go with a rule
>
> if [dist(A) == dist(A,B)] the [A => B]
>
> but that's very fragile. Think about estimates (we're not going to work
> with
> On Mon, Dec 20, 2010 at 9:29 PM, Florian Pflug wrote:
>> You might use that to decide if either A->B or B->a looks function-like
>> enough to use the uniform bayesian approach. Or you might even go
>> further,
>> and decide *with* bayesian formula to use - the paper you cited always
>> averages
On Mon, Dec 20, 2010 at 9:29 PM, Florian Pflug wrote:
> I tried to pick up Robert's idea of quantifying "Implicativeness" -
> i.e., finding a number between 0 and 1 that describes how close the
> (A,B) are to representing a function A -> B.
Actually Heikki's idea...
> Observe that dist(A),dist(B
> On Dec18, 2010, at 17:59 , Tomas Vondra wrote:
>> It seems to me you're missing one very important thing - this was not
>> meant as a new default way to do estimates. It was meant as an option
>> when the user (DBA, developer, ...) realizes the current solution gives
>> really bad estimates (due
On Dec18, 2010, at 17:59 , Tomas Vondra wrote:
> It seems to me you're missing one very important thing - this was not
> meant as a new default way to do estimates. It was meant as an option
> when the user (DBA, developer, ...) realizes the current solution gives
> really bad estimates (due to cor
Dne 19.12.2010 21:21, Simon Riggs napsal(a):
> On Mon, 2010-12-13 at 10:38 -0500, Tom Lane wrote:
>> Robert Haas writes:
>>> On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra wrote:
The proposed solution is based on contingency tables, built for selected
groups of columns (not for each poss
On Mon, 2010-12-13 at 10:38 -0500, Tom Lane wrote:
> Robert Haas writes:
> > On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra wrote:
> >> The proposed solution is based on contingency tables, built for selected
> >> groups of columns (not for each possible group). And the contingency
> >> table give
Dne 18.12.2010 16:59, Florian Pflug napsal(a):
> On Dec18, 2010, at 08:10 , t...@fuzzy.cz wrote:
>
>>> On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
Well, not really - I haven't done any experiments with it. For two
columns selectivity equation is
(dist(A) * sel(A) + dist(
On Dec18, 2010, at 08:10 , t...@fuzzy.cz wrote:
>> On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
>>> Well, not really - I haven't done any experiments with it. For two
>>> columns selectivity equation is
>>>
>>> (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))
>>>
>>> where A and B a
> On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
>> Well, not really - I haven't done any experiments with it. For two
>> columns selectivity equation is
>>
>> (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))
>>
>> where A and B are columns, dist(X) is number of distinct values in
>> co
On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
> Well, not really - I haven't done any experiments with it. For two
> columns selectivity equation is
>
> (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))
>
> where A and B are columns, dist(X) is number of distinct values in
> column X
Dne 17.12.2010 22:41, Heikki Linnakangas napsal(a):
> On 17.12.2010 23:13, Tomas Vondra wrote:
>> Dne 17.12.2010 19:58, Robert Haas napsal(a):
>>> I haven't read the paper yet (sorry) but just off the top of my head,
>>> one possible problem here is that our n_distinct estimates aren't
>>> always v
Dne 17.12.2010 22:24, Tom Lane napsal(a):
> That seems likely to be even more unreliable than our existing
> single-column estimates :-(
>
> regards, tom lane
Well, yes :-(
I guess this is a place where we could use a multi-column index, if it
contains all the interesting c
On 17.12.2010 23:13, Tomas Vondra wrote:
Dne 17.12.2010 19:58, Robert Haas napsal(a):
I haven't read the paper yet (sorry) but just off the top of my head,
one possible problem here is that our n_distinct estimates aren't
always very accurate, especially for large tables. As we've discussed
bef
Tomas Vondra writes:
> AFAIK it will work with reasonably precise estimates, but the point is
> you need an estimate of distinct values of the whole group of columns.
> So when you want to get an estimate for queries on columns (a,b), you
> need the number of distinct value combinations of these t
Dne 17.12.2010 19:58, Robert Haas napsal(a):
> I haven't read the paper yet (sorry) but just off the top of my head,
> one possible problem here is that our n_distinct estimates aren't
> always very accurate, especially for large tables. As we've discussed
> before, making them accurate requires s
Hi,
I've read about 10 papers about estimation this week, and I have gained
some insight. I'm not going to describe each of the papers here, I plan
to set up a wiki page about cross column statistics
http://wiki.postgresql.org/wiki/Cross_Columns_Stats
and I'll put the list of papers and some b
On Fri, Dec 17, 2010 at 12:58 PM, Tomas Vondra wrote:
> In the end, all they need to compute an estimate is number of distinct
> values for each of the columns (we already have that in pg_stats) and a
> number of distinct values for the group of columns in a query. They
> really don't need any mul
Dne 12.12.2010 15:43, Heikki Linnakangas napsal(a):
> On 12.12.2010 15:17, Martijn van Oosterhout wrote:
>> On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote:
>> Very cool that you're working on this.
>
> +1
>
>>> Lets talk about one special case - I'll explain how the proposed
>>> sol
On Mon, Dec 13, 2010 at 5:45 PM, Tomas Vondra wrote:
>> I'll suggest again how to decide *which* columns to cross: whichever
>> columns are combined in composite indexes. In version 2, allow the DBA
>> to specify combinations.
> I think this is a bit early to discuss this, given the fact that we
2010/12/13 Josh Berkus :
> Tomas,
>
>> (a) find out what statistics do we need to collect and how to use it
>> (b) implement a really stupid inefficient solution
>> (c) optimize in iterations, i.e. making it faster, consuming less
>> space etc.
>
> I'll suggest again how to decide *whic
Dne 13.12.2010 22:50, Josh Berkus napsal(a):
> Tomas,
>
>> (a) find out what statistics do we need to collect and how to use it
>> (b) implement a really stupid inefficient solution
>> (c) optimize in iterations, i.e. making it faster, consuming less
>> space etc.
>
> I'll suggest aga
Tomas,
> (a) find out what statistics do we need to collect and how to use it
> (b) implement a really stupid inefficient solution
> (c) optimize in iterations, i.e. making it faster, consuming less
> space etc.
I'll suggest again how to decide *which* columns to cross: whichever
colu
Dne 13.12.2010 18:59, Joshua Tolley napsal(a):
> On Sun, Dec 12, 2010 at 07:10:44PM -0800, Nathan Boley wrote:
>> Another quick note: I think that storing the full contingency table is
>> wasteful since the marginals are already stored in the single column
>> statistics. Look at copulas [2] ( FWIW
Dne 13.12.2010 16:38, Tom Lane napsal(a):
> The reason that this wasn't done years ago is precisely that nobody's
> figured out how to do it with a tolerable amount of stats data and a
> tolerable amount of processing time (both at ANALYZE time and during
> query planning). It's not hard to see wh
Dne 13.12.2010 16:34, Tom Lane napsal(a):
> Tomas Vondra writes:
>> Well, until this point we've discussed failure cases involving 'AND'
>> conditions. What about 'OR' conditions? I think the current optimizer
>> computes the selectivity as 's1+s2 - s1*s2' (at least that's what I
>> found in backe
On Sun, Dec 12, 2010 at 07:10:44PM -0800, Nathan Boley wrote:
> Another quick note: I think that storing the full contingency table is
> wasteful since the marginals are already stored in the single column
> statistics. Look at copulas [2] ( FWIW I think that Josh Tolley was
> looking at this a cou
Robert Haas writes:
> On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra wrote:
>> The proposed solution is based on contingency tables, built for selected
>> groups of columns (not for each possible group). And the contingency
>> table gives you the ability to estimate the probabilities needed to
>>
Tomas Vondra writes:
> Well, until this point we've discussed failure cases involving 'AND'
> conditions. What about 'OR' conditions? I think the current optimizer
> computes the selectivity as 's1+s2 - s1*s2' (at least that's what I
> found in backend/optimizer/path/clausesel.c:630).
If you can
> On 2010-12-13 03:28, Robert Haas wrote:
>> Well, I'm not real familiar with contingency tables, but it seems like
>> you could end up needing to store a huge amount of data to get any
>> benefit out of it, in some cases. For example, in the United States,
>> there are over 40,000 postal codes, a
On 2010-12-13 03:28, Robert Haas wrote:
Well, I'm not real familiar with contingency tables, but it seems like
you could end up needing to store a huge amount of data to get any
benefit out of it, in some cases. For example, in the United States,
there are over 40,000 postal codes, and some even
> On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra wrote:
>> Dne 13.12.2010 03:00, Robert Haas napsal(a):
>>> Well, the question is what data you are actually storing. It's
>>> appealing to store a measure of the extent to which a constraint on
>>> column X constrains column Y, because you'd only ne
>> The proposed solution is based on contingency tables, built for selected
>> groups of columns (not for each possible group). And the contingency
>> table gives you the ability to estimate the probabilities needed to
>> compute the selectivity. Or am I missing something?
>
> Well, I'm not real fa
On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra wrote:
> Dne 13.12.2010 03:00, Robert Haas napsal(a):
>> Well, the question is what data you are actually storing. It's
>> appealing to store a measure of the extent to which a constraint on
>> column X constrains column Y, because you'd only need to
Dne 13.12.2010 03:00, Robert Haas napsal(a):
> Well, the question is what data you are actually storing. It's
> appealing to store a measure of the extent to which a constraint on
> column X constrains column Y, because you'd only need to store
> O(ncolumns^2) values, which would be reasonably com
> P(A|B) = P(A and B) / P(B).
Well, until this point we've discussed failure cases involving 'AND'
conditions. What about 'OR' conditions? I think the current optimizer
computes the selectivity as 's1+s2 - s1*s2' (at least that's what I
found in backend/optimizer/path/clausesel.c:630).
Sometimes
On Sun, Dec 12, 2010 at 8:46 PM, Tomas Vondra wrote:
> Dne 13.12.2010 01:05, Robert Haas napsal(a):
>> This is a good idea, but I guess the question is what you do next. If
>> you know that the "applicability" is 100%, you can disregard the
>> restriction clause on the implied column. And if it
Dne 13.12.2010 01:05, Robert Haas napsal(a):
> This is a good idea, but I guess the question is what you do next. If
> you know that the "applicability" is 100%, you can disregard the
> restriction clause on the implied column. And if it has no
> implicatory power, then you just do what we do now
On Sun, Dec 12, 2010 at 9:43 AM, Heikki Linnakangas
wrote:
> The way I think of that problem is that once you know the postcode, knowing
> the city name doesn't add any information. The postcode implies the city
> name. So the selectivity for "postcode = ? AND city = ?" should be the
> selectivity
Dne 12.12.2010 17:33, Florian Pflug napsal(a):
> On Dec12, 2010, at 15:43 , Heikki Linnakangas wrote:
>> The way I think of that problem is that once you know the postcode, knowing
>> the city name doesn't add any information. The postcode implies the city
>> name. So the selectivity for "postcod
On Dec12, 2010, at 15:43 , Heikki Linnakangas wrote:
> The way I think of that problem is that once you know the postcode, knowing
> the city name doesn't add any information. The postcode implies the city
> name. So the selectivity for "postcode = ? AND city = ?" should be the
> selectivity of
Dne 12.12.2010 15:43, Heikki Linnakangas napsal(a):
>> The classic failure case has always been: postcodes and city names.
>> Strongly correlated, but in a way that the computer can't easily see.
>
> Yeah, and that's actually analogous to the example I used in my
> presentation.
>
> The way I thi
Hi!
Dne 12.12.2010 15:17, Martijn van Oosterhout napsal(a):
>> Lets talk about one special case - I'll explain how the proposed
>> solution works, and then I'll explain how to make it more general, what
>> improvements are possible, what issues are there. Anyway this is by no
>> means a perfect or
On 12.12.2010 15:17, Martijn van Oosterhout wrote:
On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote:
Very cool that you're working on this.
+1
Lets talk about one special case - I'll explain how the proposed
solution works, and then I'll explain how to make it more general, what
i
On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote:
> Hi everyone,
>
> one of the ssesion I've attended on PgDay last week was Heikki's session
> about statistics in PostgreSQL. One of the issues he mentioned (and one
> I regularly run into) is the absence of cross-column stats. When the
60 matches
Mail list logo