Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

2015-06-22 Thread Tomas Vondra
Hi, On 06/22/2015 07:47 AM, Jeff Janes wrote: On Sat, Jun 20, 2015 at 8:28 AM, Tomas Vondra mailto:tomas.von...@2ndquadrant.com>> wrote: Hi Tomas, I've lobotomized the sampling a bit to really produce a random set of blocks first, and that produces way better estimates: stati

Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

2015-06-22 Thread Tomas Vondra
On 06/22/2015 07:21 AM, Jeff Janes wrote: On Sat, Jun 20, 2015 at 9:55 AM, Tomas Vondra mailto:tomas.von...@2ndquadrant.com>> wrote: Hi, On 06/20/2015 03:01 AM, Jeff Janes wrote: I don't think we need to really assume the density to be constant, ma

Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

2015-06-21 Thread Jeff Janes
On Sat, Jun 20, 2015 at 8:28 AM, Tomas Vondra wrote: Hi Tomas, I've lobotomized the sampling a bit to really produce a random set of > blocks first, and that produces way better estimates: > >statistics target estimate random >--

Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

2015-06-21 Thread Jeff Janes
On Sat, Jun 20, 2015 at 9:55 AM, Tomas Vondra wrote: > Hi, > > On 06/20/2015 03:01 AM, Jeff Janes wrote: > >> >> >> I don't think we need to really assume the density to be constant, >> maybe we can verify that while collecting the sample? I mean, we're >> already reading the pages, s

Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

2015-06-21 Thread Jim Nasby
On 6/20/15 12:55 PM, Tomas Vondra wrote: Well, actually I think it would be even more appropriate for very large tables. With a 2.5TB table, you don't really care whether analyze collects 5GB or 8GB sample, the difference is rather minor compared to I/O generated by the other queries etc. The cur

Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

2015-06-20 Thread Tomas Vondra
Hi, On 06/20/2015 03:01 AM, Jeff Janes wrote: Hmmm, that's probably true. OTOH correlated columns are not all that uncommon (e.g. table storing time-series data etc.), and this blowup is quite bad ... True, but we don't know how big of a problem the density-skew problem might be (

Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

2015-06-20 Thread Tomas Vondra
Hi, On 06/20/2015 05:29 PM, Feng Tian wrote: I have not read Jeff's patch, but here is how I think hash agg should work, Hash agg scan lineitem table, perform aggregation in memory. Once workmem is exhausted, it write intermediate state to disk, bucket by bucket. When lineitem table is fini

Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

2015-06-20 Thread Alvaro Herrera
Feng Tian wrote: > I have not read Jeff's patch, but here is how I think hash agg should work, I think you should discuss that in Jeff's thread, not here. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sen

Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

2015-06-20 Thread Feng Tian
On Sat, Jun 20, 2015 at 7:56 AM, Tomas Vondra wrote: > Hi, > > On 06/20/2015 08:54 AM, Feng Tian wrote: > >> >> While better sample/stats is important for choosing a good plan, in >> this query, hash agg is really the right plan. If a sort agg is >> chosen, the performance will be really really b

Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

2015-06-20 Thread Tomas Vondra
On 06/20/2015 04:17 PM, Robert Haas wrote: On Wed, Jun 17, 2015 at 1:52 PM, Tomas Vondra wrote: I'm currently running some tests on a 3TB TPC-H data set, and I tripped over a pretty bad n_distinct underestimate, causing OOM in >> HashAgg (which somehow illustrates the importance of the >> me

Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

2015-06-20 Thread Tomas Vondra
Hi, On 06/20/2015 08:54 AM, Feng Tian wrote: While better sample/stats is important for choosing a good plan, in this query, hash agg is really the right plan. If a sort agg is chosen, the performance will be really really bad. The patch that Jeff is working on is critical for a decent TPCH num

Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

2015-06-20 Thread Robert Haas
On Wed, Jun 17, 2015 at 1:52 PM, Tomas Vondra wrote: > I'm currently running some tests on a 3TB TPC-H data set, and I tripped over > a pretty bad n_distinct underestimate, causing OOM in HashAgg (which somehow > illustrates the importance of the memory-bounded hashagg patch Jeff Davis is > workin

Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

2015-06-19 Thread Feng Tian
On Wed, Jun 17, 2015 at 10:52 AM, Tomas Vondra wrote: > Hi, > > I'm currently running some tests on a 3TB TPC-H data set, and I tripped > over a pretty bad n_distinct underestimate, causing OOM in HashAgg (which > somehow illustrates the importance of the memory-bounded hashagg patch Jeff > Davis

Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

2015-06-19 Thread Jeff Janes
On Fri, Jun 19, 2015 at 1:39 PM, Tomas Vondra wrote: > On 06/19/2015 09:48 PM, Jeff Janes wrote: > >> On Fri, Jun 19, 2015 at 12:27 PM, Tomas Vondra >> mailto:tomas.von...@2ndquadrant.com>> >> wrote: >> >> But I think you might be on to something, because I manually >> collected a random

Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

2015-06-19 Thread Tomas Vondra
On 06/19/2015 09:48 PM, Jeff Janes wrote: On Fri, Jun 19, 2015 at 12:27 PM, Tomas Vondra mailto:tomas.von...@2ndquadrant.com>> wrote: But I think you might be on to something, because I manually collected a random sample with 30k rows (by explicitly generating 30k random TIDs), and I

Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

2015-06-19 Thread Jeff Janes
On Fri, Jun 19, 2015 at 12:27 PM, Tomas Vondra wrote: > But I think you might be on to something, because I manually collected a > random sample with 30k rows (by explicitly generating 30k random TIDs), and > I get this: > > tpch=# select cnt, count(*) from (select l_orderkey, count(*) AS cnt fr

Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

2015-06-19 Thread Tomas Vondra
On 06/19/2015 08:32 PM, Jeff Janes wrote: On Wed, Jun 17, 2015 at 10:52 AM, Tomas Vondra mailto:tomas.von...@2ndquadrant.com>> wrote: Hi, I'm currently running some tests on a 3TB TPC-H data set, and I tripped over a pretty bad n_distinct underestimate, causing OOM in HashAgg

Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

2015-06-19 Thread Jeff Janes
On Wed, Jun 17, 2015 at 10:52 AM, Tomas Vondra wrote: > Hi, > > I'm currently running some tests on a 3TB TPC-H data set, and I tripped > over a pretty bad n_distinct underestimate, causing OOM in HashAgg (which > somehow illustrates the importance of the memory-bounded hashagg patch Jeff > Davis