Re: [HACKERS] Group-count estimation statistics

2005-02-01 Thread Tom Lane
Manfred Koizar <[EMAIL PROTECTED]> writes: > On Mon, 31 Jan 2005 14:40:08 -0500, Tom Lane <[EMAIL PROTECTED]> wrote: >> Oh, I see, you want a "max" calculation in there too. Seems reasonable. >> Any objections? > Yes. :-( What I said is only true in the absence of any WHERE clause > (or join).

Re: [HACKERS] Group-count estimation statistics

2005-02-01 Thread Manfred Koizar
On Mon, 31 Jan 2005 14:40:08 -0500, Tom Lane <[EMAIL PROTECTED]> wrote: >Manfred Koizar <[EMAIL PROTECTED]> writes: >> That's not what I meant. I tried to say that if we have a GROUP BY >> several columns and one of these columns alone has more than N/10 >> distinct values, there's no way to get l

Re: [HACKERS] Group-count estimation statistics

2005-01-31 Thread Tom Lane
Manfred Koizar <[EMAIL PROTECTED]> writes: > That's not what I meant. I tried to say that if we have a GROUP BY > several columns and one of these columns alone has more than N/10 > distinct values, there's no way to get less than that many groups. Oh, I see, you want a "max" calculation in there

Re: [HACKERS] Group-count estimation statistics

2005-01-31 Thread Manfred Koizar
On Mon, 31 Jan 2005 11:20:31 -0500, Tom Lane <[EMAIL PROTECTED]> wrote: >Already done that way. >if (relvarcount > 1) >clamp *= 0.1; That's not what I meant. I tried to say that if we have a GROUP BY several columns and one of these columns alone has more than N/10 dis

Re: [HACKERS] Group-count estimation statistics

2005-01-31 Thread Tom Lane
Manfred Koizar <[EMAIL PROTECTED]> writes: > On Fri, 28 Jan 2005 10:53:33 -0500, Tom Lane <[EMAIL PROTECTED]> wrote: >> we should consider >> something like "clamp to size of table / 10" instead. > ... unless a *single* grouping column is estimated to have more than > N/10 distinct values, which s

Re: [HACKERS] Group-count estimation statistics

2005-01-31 Thread Manfred Koizar
On Fri, 28 Jan 2005 10:53:33 -0500, Tom Lane <[EMAIL PROTECTED]> wrote: > we should consider >something like "clamp to size of table / 10" instead. ... unless a *single* grouping column is estimated to have more than N/10 distinct values, which should be easy to check. Servus Manfred --

Re: [HACKERS] Group-count estimation statistics

2005-01-29 Thread Mischa
> From: Sailesh Krishnamurthy <[EMAIL PROTECTED]> > > "Tom" == Tom Lane <[EMAIL PROTECTED]> writes: > > Tom> The only real solution, of course, is to acquire cross-column > Tom> statistics, but I don't see that happening in the near > Tom> future. > > Another approach is a hybrid

Re: [HACKERS] Group-count estimation statistics

2005-01-28 Thread Greg Stark
Tom Lane <[EMAIL PROTECTED]> writes: > Greg Stark's thought about a power correction seemed interesting too, though > again far too optimistic to trust without some good math to back it up. Fwiw, I'm pretty sure good math is not going to back up my off-the-cuff algorithm. But I did like the answ

Re: [HACKERS] Group-count estimation statistics

2005-01-28 Thread Sailesh Krishnamurthy
> "Tom" == Tom Lane <[EMAIL PROTECTED]> writes: Tom> The only real solution, of course, is to acquire cross-column Tom> statistics, but I don't see that happening in the near Tom> future. Another approach is a hybrid hashing scheme where we use a hash table until we run out of mem

Re: [HACKERS] [pgsql-hackers] Group-count estimation statistics

2005-01-28 Thread Tom Lane
Josh Berkus writes: > Why 10? I'd think we could come up with a slightly less arbitrary number, Well, it's probably within an order of magnitude of the right thing ;-). We know we don't want 1, but 100 seems awfully optimistic. If someone can come up with a more defensible number then I'm all

Re: [HACKERS] Group-count estimation statistics

2005-01-28 Thread Tom Lane
Kris Jurka <[EMAIL PROTECTED]> writes: > The proposed change biases towards a hash plan which has no provision for > spilling to disk. Slow is one thing, but excessive memory usage and > possibly failing is another thing. Keep in mind that we are replacing 7.4 code that had a serious tendency to

Re: [HACKERS] Group-count estimation statistics

2005-01-28 Thread Kris Jurka
On Fri, 28 Jan 2005, Tom Lane wrote: > you don't GROUP BY unique combinations of columns over huge > tables --- or at least, you shouldn't expect great performance if you do. The proposed change biases towards a hash plan which has no provision for spilling to disk. Slow is one thing, but exce

Re: [HACKERS] [pgsql-hackers] Group-count estimation statistics

2005-01-28 Thread Josh Berkus
Tom, > The only real solution, of course, is to acquire cross-column > statistics, but I don't see that happening in the near future. Y'know, that's been on the todo list for a while. Surely someone is inspired for 8.1/8.2? At least for columns which are indexed together? > As a short-term

Re: [HACKERS] Group-count estimation statistics

2005-01-28 Thread Tom Lane
Greg Stark <[EMAIL PROTECTED]> writes: > So why is it any more reasonable for Postgres to assume 0 correlation than any > other value. Perhaps Postgres should calculate these cases assuming some > arbitrary level of correlation. [ shrug... ] Sure, if you want to do the legwork to develop somethin

Re: [HACKERS] Group-count estimation statistics

2005-01-28 Thread Greg Stark
Tom Lane <[EMAIL PROTECTED]> writes: > The reason this happens even with stats is that the algorithm for > estimating the number of groups in a multi-group-column situation > is basically "take the product of the number of distinct values of > each grouping column, but clamp to the number of rows

Re: [HACKERS] Group-count estimation statistics

2005-01-28 Thread Stephen Frost
* Tom Lane ([EMAIL PROTECTED]) wrote: > The only real solution, of course, is to acquire cross-column > statistics, but I don't see that happening in the near future. That'd be nice, but sounds like alot of work. > As a short-term hack, I am thinking that the "clamp to size of table" > part of th

[HACKERS] Group-count estimation statistics

2005-01-28 Thread Tom Lane
I got a complaint from a fellow Red Hatter that PG 8.0 is way slower than 7.4 on some statistical analysis tasks he was doing. Investigation showed that the primary problem was selection of Sort/GroupAgg in place of HashAgg to compute some grouped aggregates. Essentially he was doing sel