Re: [PERFORM] Sequential scan on FK join

2005-10-17 Thread Richard Huxton

Martin Nickel wrote:

EXPLAIN SELECT m.mailcode, l.lead_id
  FROM mailing m 
 INNER JOIN lead l ON m.mailing_id = l.mailing_id 
 WHERE (m.maildate >= '2005-7-01'::date 
 AND m.maildate < '2005-8-01'::date) 


Hash Join  (cost=62.13..2001702.55 rows=2711552 width=20)
  Hash Cond: ("outer".mailing_id = "inner".mailing_id)
  ->  Seq Scan on lead l  (cost=0.00..1804198.60 rows=34065260 width=8)
  ->  Hash  (cost=61.22..61.22 rows=362 width=20)
->  Index Scan using mailing_maildate_idx on mailing m  
(cost=0.00..61.22 rows=362 width=20)
  Index Cond: ((maildate >= '2005-07-01'::date) AND (maildate < 
'2005-08-01'::date))


Well the reason *why* is that the planner expects 2.71 million rows to 
be matched. If that was the case, then a seq-scan of 34 million rows 
might well make sense. The output from EXPLAIN ANALYSE would show us 
whether that estimate is correct - is it?


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Bytea poor performance

2005-10-17 Thread Andreas Pflug

NSO wrote:


Well, no. Delphi isn't better, same time just for downloading data... But
as I told before, if for ex. pgAdminIII is running on server machine it is
a lot faster, I do not know why, I was monitoring network connection
between client and server and it is using only up to 2% of full speed.. is
server can't send faster? or client is not accepting data faster?
 



Only the first number is relevant and subject to network/db/server 
issues. The second is GUI only.


Regards,
Andreas


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


[PERFORM] tsearch2/GIST performance factors?

2005-10-17 Thread Craig A. James

We are indexing about 5 million small documents using tsearch2/GIST.  Each "document" 
contains 2 to 50 words.  This is a "write once, read many" situation.  Write performance 
is unimportant, and the database contents are static.  (We build it offline.)

We're having problems with inconsistent performance, and it's very hard to 
separate the effects of various factors.  Here are the things we think may be 
relevant.

1. Total number of words

Our documents currently contain about 110,000 unique words.  Oleg wrote: "[The limit 
is] 100K, but it's very fuzzy limit." By trial and error, we've learned that 50,000 
works well, and 150,000 works poorly, so Oleg's comment appears to be a good 
rule-of-thumb.  (With SIGLENINT enlarged, as mentioned above.)  But there may be other 
factors that affect this conclusion (such as shared memory, total memory, etc.).


2. Total size of the table

8 million documents is not a very big database (each document is a few to a few 
hundred bytes), so we don't think this is relevant.


3. Number of documents per word

There seems to be a VERY strong effect related to "common" words.  When a word 
occurs in more than about 1% of the documents (say 50,000 to 150,000 documents), 
performance goes WAY down.  Not just for that specific query, but it screws up 
tsearch2/GIST completely.

We have a test of 100 queries that return 382,000 documents total.  The first 
time we run it, it's slow, about 20 minutes (as expected).  The second time we 
run it, it's very fast, about 72 seconds -- very fast!!  As long as we avoid 
queries with common words, performance is very good.

But, if we run just one query that contains a common word (a word that's in 
more than about 2% of the documents, roughly 150,000 documents), then the next 
time we run the 100 test queries, it will take 20 minutes again.

We can't simply eliminate these common words.  First of all, they can be very significant.  Second, 
it doesn't seem like 2% is "common".  I can understand that a words like "the" 
which occur in most documents shouldn't be indexed.  But a word that occurs in 2% of the database 
seems like a very good word to index, yet it causes us great problems.

I've read a bit about tsearchd, and wonder if it would solve our problem.  For 
our application, consistent performance is VERY important.  If we could lock 
the GIST index into memory, I think it would fix our problem.

I tried copying the GIST indexes (which are in a separate tablespace) to a 1 GB 
RAM disk, and it made the initial query faster, but overall performance seemed 
worse, probably because the RAM disk was using memory that could have been used 
by the file-system cache.


4. Available RAM and Disk drives

Would more RAM help?  How would we tell Postgres to use it effectively?  The 
GIST indexes are currently about 2.6 GB on the disk.

Would more disks help?  I know they would make it faster -- the 20-minute 
initial query would be reduce with a RAID drive, etc.  But I'm not concerned 
about the 20-minute initial query, I'm concerned about keeping the system in 
that super-fast state where the GIST indexes are all in memory.


Hardware:
  Dual-CPU Xeon Dell server with 4 GB memory and a single SATA 7200 RPM 150GB 
disk.

tsearch2/gistidx.h
  modified as: #define SIGLENINT  120

System configuration:
  echo 2147483648 >/proc/sys/kernel/shmmax
  echo 4096  >/proc/sys/kernel/shmmni
  echo 2097152   >/proc/sys/kernel/shmall

Postgres Configuration:
  shared_buffers = 2
  work_mem = 32768
  effective_cache_size = 30

Thanks very much for any comments and advice.

Craig



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Sequential scan on FK join

2005-10-17 Thread Martin Nickel
Subject:  Re: Sequential scan on FK join
From: Martin Nickel <[EMAIL PROTECTED]>
Newsgroups:   pgsql.performance
Date: Wed, 12 Oct 2005 15:53:35 -0500

Richard, here's the EXPLAIN ANALYZE.  I see your point re: the 2.7M
expected vs the 2 actual, but I've run ANALYZE on the lead table and it
hasn't changed the plan.  Suggestions?

"Hash Join  (cost=62.13..2001702.55 rows=2711552 width=20) (actual
time=40.659..244709.315 rows=2  125270 loops=1)" "  Hash Cond:
("outer".mailing_id = "inner".mailing_id)" "  ->  Seq Scan on lead l
(cost=0.00..1804198.60 rows=34065260 width=8) (actual
time=8.621..180281.094 rows=34060373 loops=1)" "  ->  Hash
(cost=61.22..61.22 rows=362 width=20) (actual time=28.718..28.718 rows=0
loops=1)" "->  Index Scan using mailing_maildate_idx on mailing m
(cost=0.00..61.22 rows=362 width=20) (actual time=16.571..27.793 rows=430
loops=1)" "  Index Cond: ((maildate >= '2005-07-01'::date) AND
(maildate < '2005-08-01'::date))" "Total runtime: 248104.339 ms"



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[PERFORM] tsearch2/GIST performance factors?

2005-10-17 Thread Craig A. James

We are indexing about 5 million small documents using tsearch2/GIST.  Each "document" 
contains 2 to 50 words.  This is a "write once, read many" situation.  Write performance 
is unimportant, and the database contents are static.  (We build it offline.)

We're having problems with inconsistent performance, and it's very hard to 
separate the effects of various factors.  Here are the things we think may be 
relevant.

1. Total number of words

Our documents currently contain about 110,000 unique words.  Oleg wrote: "[The limit 
is] 100K, but it's very fuzzy limit." By trial and error, we've learned that 50,000 
works well, and 150,000 works poorly, so Oleg's comment appears to be a good 
rule-of-thumb.  (With SIGLENINT enlarged, see below.)  But there may be other factors 
that affect this conclusion (such as shared memory, total memory, etc.).


2. Total size of the table

5 million documents is not a very big database (each document is a few to a few 
hundred bytes), so we don't think this is relevant.


3. Number of documents per word

There seems to be a VERY strong effect related to "common" words.  When a word 
occurs in more than about 1% of the documents (say 50,000 to 150,000 documents), 
performance goes WAY down.  Not just for that specific query, but it screws up 
tsearch2/GIST completely.

We have a test of 100 queries that return 382,000 documents total.  The first 
time we run it, it's slow, about 20 minutes (as expected).  The second time we 
run it, it's very fast, about 72 seconds -- very fast!!  As long as we avoid 
queries with common words, performance is very good.

But, if we run just one query that contains a common word (a word that's in 
more than about 2% of the documents, roughly 150,000 documents), then the next 
time we run the 100 test queries, it will take 20 minutes again.

We can't simply eliminate these common words.  First of all, they can be very significant.  Second, 
it doesn't seem like 2% is "common".  I can understand that a words like "the" 
which occur in most documents shouldn't be indexed.  But a word that occurs in 2% of the database 
seems like a very good word to index, yet it causes us great problems.

I've read a bit about tsearchd, and wonder if it would solve our problem.  For 
our application, consistent performance is VERY important.  If we could lock 
the GIST index into memory, I think it would fix our problem.

I tried copying the GIST indexes (which are in a separate tablespace) to a 1 GB 
RAM disk, and it made the initial query faster, but overall performance seemed 
worse, probably because the RAM disk was using memory that could have been used 
by the file-system cache.


4. Available RAM and Disk drives

Would more RAM help?  How would we tell Postgres to use it effectively?  The 
GIST indexes are currently about 2.6 GB on the disk.

Would more disks help?  I know they would make it faster -- the 20-minute 
initial query would be reduce with a RAID drive, etc.  But I'm not concerned 
about the 20-minute initial query, I'm concerned about keeping the system in 
that super-fast state where the GIST indexes are all in memory.


Hardware:
Dual-CPU Xeon Dell server with 4 GB memory and a single SATA 7200 RPM 150GB 
disk.

tsearch2/gistidx.h
modified as: #define SIGLENINT  120

System configuration:
echo 2147483648 >/proc/sys/kernel/shmmax
echo 4096  >/proc/sys/kernel/shmmni
echo 2097152   >/proc/sys/kernel/shmall

Postgres Configuration:
shared_buffers = 2  
work_mem = 32768
effective_cache_size = 30

I feel like I'm shooting in the dark -- Linux, Postgres and tsearch2/GIST are 
interacting in ways that I can't predict or analyze.  Thanks very much for any 
comments and advice.

Craig





---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Sequential scan on FK join

2005-10-17 Thread Richard Huxton

Martin Nickel wrote:

Subject:  Re: Sequential scan on FK join
From: Martin Nickel <[EMAIL PROTECTED]>
Newsgroups:   pgsql.performance
Date: Wed, 12 Oct 2005 15:53:35 -0500

Richard, here's the EXPLAIN ANALYZE.  I see your point re: the 2.7M
expected vs the 2 actual, but I've run ANALYZE on the lead table and it
hasn't changed the plan.  Suggestions?

Hash Join  (cost=62.13..2001702.55 rows=2711552 width=20) 
(actual time=40.659..244709.315 rows=2 125270 loops=1)

   ^^^
Hmm - is that not just a formatting gap there? Is it not 2,125,270 rows 
matching which would suggest PG is getting it more right than wrong.


Try issuing "SET enable_seqscan=false" before running the explain 
analyse - that will force the planner to use any indexes it can find and 
should show us whether the index would help.

--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] tsearch2/GIST performance factors?

2005-10-17 Thread Oleg Bartunov

On Sat, 15 Oct 2005, Craig A. James wrote:

We are indexing about 5 million small documents using tsearch2/GIST.  Each 
"document" contains 2 to 50 words.  This is a "write once, read many" 
situation.  Write performance is unimportant, and the database contents are 
static.  (We build it offline.)


We're having problems with inconsistent performance, and it's very hard to 
separate the effects of various factors.  Here are the things we think may be 
relevant.


1. Total number of words

Our documents currently contain about 110,000 unique words.  Oleg wrote: 
"[The limit is] 100K, but it's very fuzzy limit." By trial and error, we've 
learned that 50,000 works well, and 150,000 works poorly, so Oleg's comment 
appears to be a good rule-of-thumb.  (With SIGLENINT enlarged, see below.) 
But there may be other factors that affect this conclusion (such as shared 
memory, total memory, etc.).




Did you consider *decreasing* SIGLENINT ? Size of index will diminish
and performance could be increased. I use in current project SIGLENINT=15



2. Total size of the table

5 million documents is not a very big database (each document is a few to a 
few hundred bytes), so we don't think this is relevant.



3. Number of documents per word

There seems to be a VERY strong effect related to "common" words.  When a 
word occurs in more than about 1% of the documents (say 50,000 to 150,000 
documents), performance goes WAY down.  Not just for that specific query, but 
it screws up tsearch2/GIST completely.


We have a test of 100 queries that return 382,000 documents total.  The first 
time we run it, it's slow, about 20 minutes (as expected).  The second time 
we run it, it's very fast, about 72 seconds -- very fast!!  As long as we 
avoid queries with common words, performance is very good.


But, if we run just one query that contains a common word (a word that's in 
more than about 2% of the documents, roughly 150,000 documents), then the 
next time we run the 100 test queries, it will take 20 minutes again.




We can't simply eliminate these common words.  First of all, they can be very 
significant.  Second, it doesn't seem like 2% is "common".  I can understand 
that a words like "the" which occur in most documents shouldn't be indexed. 
But a word that occurs in 2% of the database seems like a very good word to 
index, yet it causes us great problems.




tsearch2's index is a lossy index, read 
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals
so search results  should be rechecked !


I've read a bit about tsearchd, and wonder if it would solve our problem. 
For our application, consistent performance is VERY important.  If we could 
lock the GIST index into memory, I think it would fix our problem.


I think so, tsearchd was designed for static contents in mind and it's
index doesn't require rechecking !



I tried copying the GIST indexes (which are in a separate tablespace) to a 1 
GB RAM disk, and it made the initial query faster, but overall performance 
seemed worse, probably because the RAM disk was using memory that could have 
been used by the file-system cache.



4. Available RAM and Disk drives

Would more RAM help?  How would we tell Postgres to use it effectively?  The 
GIST indexes are currently about 2.6 GB on the disk.


try to decrease signature size, say, 
#define SIGLENINT  15



I feel like I'm shooting in the dark -- Linux, Postgres and tsearch2/GIST are 
interacting in ways that I can't predict or analyze.  Thanks very much for 
any comments and advice.


We have our TODO http://www.sai.msu.su/~megera/oddmuse/index.cgi/todo
and hope to find sponsorhips for fts project for 8.2 release.
Unfortunately, I didn't find spare time to package tsearchd for you,
it should certainly help you.


Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Sequential scan on FK join

2005-10-17 Thread Martin Nickel
When I turn of seqscan it does use the index - and it runs 20 to 30%
longer.  Based on that, the planner is correctly choosing a sequential
scan - but that's just hard for me to comprehend.  I'm joining on an int4
key, 2048 per index page - I guess that's a lot of reads - then the data
-page reads.  Still, the 8-minute query time seems excessive.  

On Mon, 17 Oct 2005 18:45:38 +0100, Richard Huxton wrote:

> Martin Nickel wrote:
>> Subject:  Re: Sequential scan on FK join From: Martin Nickel
>> <[EMAIL PROTECTED]> Newsgroups:   pgsql.performance
>> Date: Wed, 12 Oct 2005 15:53:35 -0500
>> 
>> Richard, here's the EXPLAIN ANALYZE.  I see your point re: the 2.7M
>> expected vs the 2 actual, but I've run ANALYZE on the lead table and it
>> hasn't changed the plan.  Suggestions?
>> 
>> Hash Join  (cost=62.13..2001702.55 rows=2711552 width=20) (actual
>> time=40.659..244709.315 rows=2 125270 loops=1)
> ^^^
> Hmm - is that not just a formatting gap there? Is it not 2,125,270 rows
> matching which would suggest PG is getting it more right than wrong.
> 
> Try issuing "SET enable_seqscan=false" before running the explain analyse
> - that will force the planner to use any indexes it can find and should
> show us whether the index would help. --
>Richard Huxton
>Archonet Ltd
> 
> ---(end of broadcast)---
> TIP 4: Have you searched our list archives?
> 
>http://archives.postgresql.org


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Bytea poor performance

2005-10-17 Thread Mark Kirkwood

NSO wrote:


Well, no. Delphi isn't better, same time just for downloading data... But
as I told before, if for ex. pgAdminIII is running on server machine it is
a lot faster, I do not know why, I was monitoring network connection
between client and server and it is using only up to 2% of full speed.. is
server can't send faster? or client is not accepting data faster?

 


That difference is suspiciously high - you need to get one of your 
network boys to check that the NIC in your client box is operating at 
full speed (and/or does not clash with whatever network device it is 
plugged into). The other thing to check that that your client box is 
reasonably spec'ed : e.g. not running out of ram or disk in particular - 
or suffering from massively fragmented disk (the latter if its win32).


With respect to the Delphi, you can probably narrow where it has issues 
by running test versions of your app that have bits of functionality 
removed:


- retrieves the bytea but does not display it
- retrieves the bytea but displays it unformatted, or truncated
- does not retrieve the bytea at all

The difference between these should tell you where your issue is!

By way of comparison, I have a Php page (no Delphi sorry) that 
essentially shows 50 rows from your files table over a 100Mbit network. 
Some experiments with that show:


- takes 2 seconds to display in Firefox
- takes 0.2 seconds to complete a request (i.e. "display") using httperf

This indicates that (in my case) most of the 2 seconds is being used by 
Firefox (not being very good at) formatting the wide output for display.


The figure of about 2-5 seconds seems about right, so your 20-30 seconds 
certainly seems high!



cheers

Mark

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] tsearch2/GIST performance factors?

2005-10-17 Thread Craig A. James

Oleg wrote:

Did you consider *decreasing* SIGLENINT ? Size of index will diminish
and performance could be increased. I use in current project SIGLENINT=15


The default value for SIGLENINT actually didn't work at all.  It was only by 
increasing it that I got any performance at all.  An examination of the GIST 
indexes showed that most of the first level and many of the second level 
bitmaps were saturated.

tsearch2's index is a lossy index, read 
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals

so search results  should be rechecked !


Yes, thanks.  We do indeed recheck the actual results.  The tests I'm running are just on 
the raw index performance - how long does it take to "select ... where dockeys @@ 
to_tsquery(...)".


We have our TODO http://www.sai.msu.su/~megera/oddmuse/index.cgi/todo
and hope to find sponsorhips for fts project for 8.2 release.
Unfortunately, I didn't find spare time to package tsearchd for you,
it should certainly help you.


At this point we may not have time to try tsearchd, and unfortunately we're not 
in a position to sponsor anything yet.

My original question is still bothering me.  Is it normal for a keyword that 
occurs in more than about 2% of the documents to cause such inconsistent 
performance?  Is there any single thing I might look at that would help improve 
performance (like, do I need more memory?  More shared memory?  Different 
config parameters?)

Thanks,
Craig

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] tsearch2/GIST performance factors?

2005-10-17 Thread Oleg Bartunov

Craig,

could you prepare excerption from your db (if possible), so I could
play myself ?

Oleg
On Mon, 17 Oct 2005, Craig A. James wrote:


Oleg wrote:

Did you consider *decreasing* SIGLENINT ? Size of index will diminish
and performance could be increased. I use in current project SIGLENINT=15


The default value for SIGLENINT actually didn't work at all.  It was only by 
increasing it that I got any performance at all.  An examination of the GIST 
indexes showed that most of the first level and many of the second level 
bitmaps were saturated.


tsearch2's index is a lossy index, read 
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals

so search results  should be rechecked !


Yes, thanks.  We do indeed recheck the actual results.  The tests I'm running 
are just on the raw index performance - how long does it take to "select ... 
where dockeys @@ to_tsquery(...)".



We have our TODO http://www.sai.msu.su/~megera/oddmuse/index.cgi/todo
and hope to find sponsorhips for fts project for 8.2 release.
Unfortunately, I didn't find spare time to package tsearchd for you,
it should certainly help you.


At this point we may not have time to try tsearchd, and unfortunately we're 
not in a position to sponsor anything yet.


My original question is still bothering me.  Is it normal for a keyword that 
occurs in more than about 2% of the documents to cause such inconsistent 
performance?  Is there any single thing I might look at that would help 
improve performance (like, do I need more memory?  More shared memory? 
Different config parameters?)


Thanks,
Craig



Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---(end of broadcast)---
TIP 6: explain analyze is your friend