[PERFORM] PostgreSQL 9.1 : why is this query slow?

2011-11-28 Thread Joost Kraaijeveld
Hi All,

I have a table with 665605 rows (counted, vacuum-ed):
CREATE TABLE unique_words
( filename text NOT NULL,
  filetype text NOT NULL,
  word text NOT NULL,
  count integer,)

The query is:
select f.word , count(f.word) from 
unique_words as f, 
unique_words as s ,
unique_words as n 
where
(f.word = s.word and s.word = n.word)
and
(f.filetype = 'f' and s.filetype = 's' and n.filetype = 'n')
group by f.word

Explain says:
"GroupAggregate  (cost=0.00..67237557.88 rows=1397 width=6)"
"  ->  Nested Loop  (cost=0.00..27856790.31 rows=7876150720 width=6)"
"->  Nested Loop  (cost=0.00..118722.04 rows=14770776 width=12)"
"  ->  Index Scan using idx_unique_words_filetype_word on 
unique_words f  (cost=0.00..19541.47 rows=92098 width=6)"
"Index Cond: (filetype = 'f'::text)"
"  ->  Index Scan using idx_unique_words_filetype_word on 
unique_words s  (cost=0.00..0.91 rows=13 width=6)"
"Index Cond: ((filetype = 's'::text) AND (word = f.word))"
"->  Index Scan using idx_unique_words_filetype_word on unique_words n  
(cost=0.00..1.33 rows=44 width=6)"
"  Index Cond: ((filetype = 'n'::text) AND (word = f.word))"


The right answer should be 3808 different words (according to a Java
program I wrote).

This query takes more than 1 hour (after which I cancelled the query).
My questions are:
- Is this to be expected?
- Especially as the query over just 1 join takes 32 secs? (on f.word =
s.word omitting everything for n )
- Why does explain say it takes "7876150720 rows"? 
- Is there a way to rephrase the query that makes it faster?
- Could another table layout help (f,s,n are all possibilities for
filetype)?
- Anything else?

TIA

-- 
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
web: www.askesis.nl


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PostgreSQL 9.1 : why is this query slow?

2011-11-28 Thread Joost Kraaijeveld
On Mon, 2011-11-28 at 17:42 +0100, Joost Kraaijeveld wrote:
> - Is there a way to rephrase the query that makes it faster?
This query goes faster (6224 ms, but I am not sure it gives the correct
answer as the result differs from my Java program):

select word, count (word) from unique_words 
where
word in (select word from unique_words where 
 word in ( select word from unique_words where filetype = 'f')
 and
 filetype = 's')
and
filetype = 'n'
group by word


-- 
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
web: www.askesis.nl


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PostgreSQL 9.1 : why is this query slow?

2011-11-28 Thread Kevin Grittner
Joost Kraaijeveld  wrote:
 
> This query goes faster (6224 ms, but I am not sure it gives the
> correct answer as the result differs from my Java program):
 
It seems clear that you want to see words which appear with all
three types of files, but it's not clear what you want the count to
represent.  The number of times the word appears in filetype 'n'
references (as specified in your second query)?  The number of
permutations of documents which incorporate one 'f' document, one
's' document, and one 'n' document (as specified in your first
query).  Something else, like the total number of times the word
appears?
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PostgreSQL 9.1 : why is this query slow?

2011-11-28 Thread Joost Kraaijeveld
On Mon, 2011-11-28 at 11:05 -0600, Kevin Grittner wrote:
> Joost Kraaijeveld  wrote:
>  
> > This query goes faster (6224 ms, but I am not sure it gives the
> > correct answer as the result differs from my Java program):
>  
> It seems clear that you want to see words which appear with all
> three types of files, but it's not clear what you want the count to
> represent.  The number of times the word appears in filetype 'n'
> references (as specified in your second query)?  The number of
> permutations of documents which incorporate one 'f' document, one
> 's' document, and one 'n' document (as specified in your first
> query).  Something else, like the total number of times the word
> appears?
I would like the answer to be "the number of times the word appears in
all three the queries", the intersection of the three queries. 

-- 
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
web: www.askesis.nl


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PostgreSQL 9.1 : why is this query slow?

2011-11-28 Thread Kevin Grittner
Joost Kraaijeveld  wrote:
 
> I would like the answer to be "the number of times the word
> appears in all three the queries", the intersection of the three
> queries. 
 
That's still not entirely clear to me.  If there are two 'f' rows,
three 's' rows, and four 'n' rows, do you want to see an answer of 2
(which seems like the intersection you request here), 9 (which is
the sum), 24 (which is the product), or something else?
 
If you really want the intersection, perhaps:
 
with x as
  (
select
word,
count(*) as countall,
count(case when filetype = 'f' then 1 else null end)
  as countf,
count(case when filetype = 's' then 1 else null end) as
  as counts,
count(case when filetype = 'n' then 1 else null end) as
  as countn
  from unique_words
  )
select word, least(countf, counts, countn) from x
  where countf > 0 and counts > 0 and countn > 0
  order by word;
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PostgreSQL 9.1 : why is this query slow?

2011-11-28 Thread Kevin Grittner
"Kevin Grittner"  wrote:
 
> If you really want the intersection, perhaps:
 
Or maybe closer:
 
with x as
  (
select
word,
count(*) as countall,
count(case when filetype = 'f' then 1 else null end)
  as countf,
count(case when filetype = 's' then 1 else null end)
  as counts,
count(case when filetype = 'n' then 1 else null end)
  as countn
  from unique_words
  group by word
  )
select word, least(countf, counts, countn) from x
  where countf > 0 and counts > 0 and countn > 0
  order by word;
 
Cranked out rather quickly and untested.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PostgreSQL 9.1 : why is this query slow?

2011-11-28 Thread Tomas Vondra
On 28.11.2011 17:42, Joost Kraaijeveld wrote:
> - Why does explain say it takes "7876150720 rows"? 

Any idea where this number came from? No matter what I do, the nested
loop row estimates are alway very close to the product of the two
estimates (outer rows * inner rows).

Tomas

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Benchmarking tools, methods

2011-11-28 Thread CSS
On Nov 19, 2011, at 11:21 AM, Greg Smith wrote:

> On 11/18/2011 04:55 AM, CSS wrote:
>> I'm also curious about benchmarking using my own data.  I tried something 
>> long ago that at least gave the illusion of working, but didn't seem quite 
>> right to me.  I enabled basic query logging on one of our busier servers, 
>> dumped the db, and let it run for 24 hours.  That gave me the normal random 
>> data from users throughout the day as well as our batch jobs that run 
>> overnight.  I had to grep out and reformat the actual queries from the 
>> logfile, but that was not difficult.   I then loaded the dump into the test 
>> server and basically fed the saved queries into it and timed the result.  I 
>> also hacked together a script to sample cpu and disk stats every 2S and had 
>> that feeding into an rrd database so I could see how "busy" things were.
>> 
>> In theory, this sounded good (to me), but I'm not sure I trust the results.  
>> Any suggestions on the general concept?  Is it sound?  Is there a better way 
>> to do it?  I really like the idea of using (our) real data.
>>   
> 
> The thing that's hard to do here is replay the activity with the right 
> timing.  Some benchmarks, such as pgbench, will hit the database as fast as 
> it will process work.  That's not realistic.  You really need to consider 
> that real applications have pauses in them, and worry about that both in 
> playback speed and in results analysis.
> 
> See http://wiki.postgresql.org/wiki/Statement_Playback for some more info on 
> this.

Thanks so much for this, and thanks to Cédric for also pointing out Tsung 
specifically on that page.  I had no idea any of these tools existed.  I really 
like the idea of "application specific" testing, it makes total sense for the 
kind of things we're trying to measure.

I also wanted to thank everyone else that posted in this thread, all of this 
info is tremendously helpful.  This is a really excellent list, and I really 
appreciate all the people posting here that make their living doing paid 
consulting taking the time to monitor and post on this list.  Yet another way 
for me to validate choosing postgres over that "other" open source db.


>> ps - considering the new PostgreSQL Performance book that Packt has, any 
>> strong feelings about that one way or the other?  Does it go very far beyond 
>> what's on the wiki?
>>   
> 
> Pages 21 through 97 are about general benchmarking and hardware setup; 189 
> through 208 cover just pgbench.  There's almost no overlap between those 
> sections and the wiki, which is mainly focused on PostgreSQL usage issues.  
> Unless you're much smarter than me,  you can expect to spent months to years 
> reinventing wheels described there before reaching new ground in the areas it 
> covers.  From the questions you've been asking, you may not find as much 
> about ZFS tuning and SSDs as you'd like though.

We're grabbing a copy of it for the office.  Packt is running a sale, so we're 
also going to grab the "cookbook", it looks intriguing.

> http://www.2ndquadrant.com/en/talks/ has some updated material about things 
> discovered since the book was published.  The "Bottom-Up Database 
> Benchmarking" there shows the tests I'm running nowadays, which have evolved 
> a bit in the last year.

Looks like good stuff, thanks.

Charles

> -- 
> Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
> PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
> 
> 
> -- 
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance