[PERFORM] Slow performance with no apparent reason

2003-10-26 Thread Yonatan Goraly




I am in the process of  adding PostgreSQL support for an application,
in addition to Oracle and MS SQL.
I am using PostgreSQL version 7.3.2, Red Hat 9.0 on Intel Pentium III
board.

I have a query that generally looks like this:

SELECT t1.col1, t2.col1 FROM t1, t2 WHERE t1.x=t2.y AND t2.p='string' AND t2.q=1

This query is strikingly slow (about 100 sec when both t1 and t2 has
about 1,200 records, compare with less than 4 sec with MS SQL and
Oracle)

The strange thing is that if I remove one of the last 2 conditions
(doesn't matter which one), I get the same performance like with the
other databases.
Since in this particular case both conditions ( t2.p='string', t2.q=1)
are not required, I can't understand why having both turns the query so
slow.
A query on table t2 alone is fast with or without the 2 conditions.

I tired several alternatives, this one works pretty well:

SELECT t1.col1, t2.col1 FROM t1, t2 WHERE t1.x=t2.y AND
 EXISTS ( 
             SELECT * FROM t2 t2a WHERE t2a.p='string' AND t2a.q=1 AND t2a.y=t2.y )

Since the first query is simpler than the second, it seems to me like a
bug.

Please advise

Yonatan




Re: [PERFORM] Slow performance with no apparent reason

2003-10-26 Thread Yonatan Goraly
I run ANALYZE and the problem resolved

Thanks

Yonatan Goraly kirjutas P, 26.10.2003 kell 00:25:
 

I am in the process of  adding PostgreSQL support for an application,
in addition to Oracle and MS SQL.
I am using PostgreSQL version 7.3.2, Red Hat 9.0 on Intel Pentium III
board.
I have a query that generally looks like this:

SELECT t1.col1, t2.col1 FROM t1, t2 WHERE t1.x=t2.y AND t2.p='string'
AND t2.q=1
This query is strikingly slow (about 100 sec when both t1 and t2 has
about 1,200 records, compare with less than 4 sec with MS SQL and
Oracle)
   

always send results of EXPLAIN ANALYZE if you ask for help on [PERFORM] 

knowing which indexes you have would also help.

and you should have run ANALYZE too.

-
Hannu
 



---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


[PERFORM] Various performance questions

2003-10-26 Thread Dror Matalon
Hi,

We're in the process of setting up a new database server. The
application is an online rss aggregator which you can see at
www.fastbuzz.com (still running with the old hardware).

The new machine is a dual Xeon with 2 Gigs of ram 

The OS is freebsd 4.9. 

shared_buffers = 1
sort_mem = 32768
effective_cache_size = 25520  -- freebsd forumla: vfs.hibufspace / 8192

1. While it seems to work correctly, I'm unclear on why this number is
correct.  25520*8 = 204160 or 200 Megs. On a machine with 2 Gigs it
seems like the number should be more like 1 - 1.5 Gigs.

2.  The main performance challenges are with the items table which has around
five million rows and grows at the rate of more than 100,000 rows a day.

If I do a select count(*) from the items table it take 55 - 60 seconds
to execute. I find it interesting that it takes that long whether it's
doing it the first time and fetching the pages from disk or on
subsequent request where it fetches the pages from memory.
I know that it's not touching the disks because I'm running an iostat in
a different window. Here's the explain analyze:

explain analyze select count(*) from items;
QUERY PLAN
--
 Aggregate  (cost=245377.53..245377.53 rows=1 width=0) (actual 
time=55246.035..55246.040 rows=1 loops=1)
   ->  Seq Scan on items  (cost=0.00..233100.62 rows=4910762 width=0)
(actual time=0.054..30220.641 rows=4910762 loops=1)
 Total runtime: 55246.129 ms
(3 rows)

and the number of pages:

select relpages from pg_class where relname = 'items';
 relpages
--
   183993


So does it make sense that it would take close to a minute to count the 5 million rows
even if all pages are in memory? 

3. Relpages is 183993 so file size should be  183993*8192 = 1507270656,
roughly 1.5 gig. The actual disk size is 1073741824 or roughly 1 gig.
Why the difference?



4. If I put a certain filter/condition on the query it tells me that it's doing
a sequential scan, and yet it takes less time than a full sequential
scan:

explain analyze select count(*) from items where channel < 5000;
QUERY PLAN
--
 Aggregate  (cost=249141.54..249141.54 rows=1 width=0) (actual 
time=26224.603..26224.608 rows=1 loops=1)
   ->  Seq Scan on items  (cost=0.00..245377.52 rows=1505605 width=0) (actual 
time=7.599..17686.869 rows=1632057 loops=1)
 Filter: (channel < 5000)
 Total runtime: 26224.703 ms


How can it do a sequential scan and apply a filter to it in less time
than the full sequential scan? Is it actually using an index without
really telling me? 


Here's the structure of the items table

Column |   Type   | Modifiers
---+--+---
 articlenumber | integer  | not null
 channel   | integer  | not null
 title | character varying|
 link  | character varying|
 description   | character varying|
 comments  | character varying(500)   |
 dtstamp   | timestamp with time zone |
 signature | character varying(32)|
 pubdate   | timestamp with time zone |
Indexes:
"item_channel_link" btree (channel, link)
"item_created" btree (dtstamp)
"item_signature" btree (signature)
"items_channel_article" btree (channel, articlenumber)
"items_channel_tstamp" btree (channel, dtstamp)


5. Any other comments/suggestions on the above setup.

Thanks,

Dror

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.zapatec.com

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Various performance questions

2003-10-26 Thread Greg Stark
Dror Matalon <[EMAIL PROTECTED]> writes:

> explain analyze select count(*) from items where channel < 5000;
> QUERY PLAN
> --
>  Aggregate  (cost=249141.54..249141.54 rows=1 width=0) (actual 
> time=26224.603..26224.608 rows=1 loops=1)
>->  Seq Scan on items  (cost=0.00..245377.52 rows=1505605 width=0) (actual 
> time=7.599..17686.869 rows=1632057 loops=1)
>  Filter: (channel < 5000)
>  Total runtime: 26224.703 ms
> 
> 
> How can it do a sequential scan and apply a filter to it in less time
> than the full sequential scan? Is it actually using an index without
> really telling me? 

It's not using the index and not telling you. 

It's possible the count(*) operator itself is taking some time. Postgres
doesn't have to call it on the rows that don't match the where clause. How
long does "explain analyze select 1 from items" with and without the where
clause take?

What version of postgres is this?. In 7.4 (and maybe 7.3?) count() uses an
int8 to store its count so it's not limited to 4 billion records.
Unfortunately int8 is somewhat inefficient as it has to be dynamically
allocated repeatedly. It's possible it's making a noticeable difference,
especially with all the pages in cache, though I'm a bit surprised. There's
some thought about optimizing this in 7.5.

-- 
greg


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


[PERFORM] Duplicate in pg_user table

2003-10-26 Thread CHEWTC
Hi

Currently we are running Postgresql v7.3.2 on Redhat Linux OS v9.0. We have
Windows2000 client machines inserting records into the Postgresql tables
via the Postgres ODBC v7.3.0100.

After a few weeks of usage, when we do a \d at the sql prompt, there was a
duplicate object name in the same schema, ie it can be a duplicate row of
index or table.

When we do a \d table_name, it will show a duplication of column names
inside the table.

We discovered that the schema in the pg_user table was duplicated also;
thus causing the pg_dump to fail.

Thank you,
REgards.





---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Various performance questions

2003-10-26 Thread Dror Matalon
On Sun, Oct 26, 2003 at 10:49:29PM -0500, Greg Stark wrote:
> Dror Matalon <[EMAIL PROTECTED]> writes:
> 
> > explain analyze select count(*) from items where channel < 5000;
> > QUERY PLAN
> > --
> >  Aggregate  (cost=249141.54..249141.54 rows=1 width=0) (actual 
> > time=26224.603..26224.608 rows=1 loops=1)
> >->  Seq Scan on items  (cost=0.00..245377.52 rows=1505605 width=0) (actual 
> > time=7.599..17686.869 rows=1632057 loops=1)
> >  Filter: (channel < 5000)
> >  Total runtime: 26224.703 ms
> > 
> > 
> > How can it do a sequential scan and apply a filter to it in less time
> > than the full sequential scan? Is it actually using an index without
> > really telling me? 
> 
> It's not using the index and not telling you. 
> 
> It's possible the count(*) operator itself is taking some time. Postgres

I find it hard to believe that the actual counting would take a
significant amount of time.

> doesn't have to call it on the rows that don't match the where clause. How
> long does "explain analyze select 1 from items" with and without the where
> clause take?

Same as count(*). Around 55 secs with no where clause, around 25 secs
with.

> 
> What version of postgres is this?. In 7.4 (and maybe 7.3?) count() uses an

This is 7.4.

> int8 to store its count so it's not limited to 4 billion records.
> Unfortunately int8 is somewhat inefficient as it has to be dynamically
> allocated repeatedly. It's possible it's making a noticeable difference,
> especially with all the pages in cache, though I'm a bit surprised. There's
> some thought about optimizing this in 7.5.
> 
> -- 
> greg
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.zapatec.com

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Various performance questions

2003-10-26 Thread Christopher Browne
[EMAIL PROTECTED] (Dror Matalon) wrote:
> On Sun, Oct 26, 2003 at 10:49:29PM -0500, Greg Stark wrote:
>> Dror Matalon <[EMAIL PROTECTED]> writes:
>> 
>> > explain analyze select count(*) from items where channel < 5000;
>> > QUERY PLAN
>> > --
>> >  Aggregate  (cost=249141.54..249141.54 rows=1 width=0) (actual 
>> > time=26224.603..26224.608 rows=1 loops=1)
>> >->  Seq Scan on items  (cost=0.00..245377.52 rows=1505605 width=0) (actual 
>> > time=7.599..17686.869 rows=1632057 loops=1)
>> >  Filter: (channel < 5000)
>> >  Total runtime: 26224.703 ms
>> > 
>> > 
>> > How can it do a sequential scan and apply a filter to it in less time
>> > than the full sequential scan? Is it actually using an index without
>> > really telling me? 
>> 
>> It's not using the index and not telling you. 
>> 
>> It's possible the count(*) operator itself is taking some time. Postgres
>
> I find it hard to believe that the actual counting would take a
> significant amount of time.

Most of the time involves:

 a) Reading each page of the table, and
 b) Figuring out which records on those pages are still "live."

What work were you thinking was involved in doing the counting?

>> doesn't have to call it on the rows that don't match the where clause. How
>> long does "explain analyze select 1 from items" with and without the where
>> clause take?
>
> Same as count(*). Around 55 secs with no where clause, around 25 secs
> with.

Good; at least that's consistent...
-- 
(format nil "[EMAIL PROTECTED]" "cbbrowne" "acm.org")
http://www3.sympatico.ca/cbbrowne/postgresql.html
Signs of a Klingon  Programmer #2: "You  question the worthiness of my
code? I should kill you where you stand!"

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] slow select

2003-10-26 Thread Shridhar Daithankar
Vivek Khera wrote:
"JB" == Josh Berkus <[EMAIL PROTECTED]> writes:
JB> Actually, what OS's can't use all idle ram for kernel cache?  I
JB> should note that in my performance docs 
FreeBSD.  Limited by the value of "sysctl vfs.hibufspace" from what I
understand.  This value is set at boot based on available RAM and some
other tuning parameters.
Actually I wanted to ask this question for long time. Can we have guidelines 
about how to set effective cache size for various OSs?

Linux is pretty simple. Everything free is buffer cache. FreeBSD, not so 
straightforward but there is a sysctl..

How about HP-UX, Solaris and AIX? Other BSDs? and most importantly windows?

That could add much value to the tuning guide. Isn't it?

 Shridhar

---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [PERFORM] Slow performance with no apparent reason

2003-10-26 Thread Hannu Krosing
Yonatan Goraly kirjutas P, 26.10.2003 kell 00:25:
> I am in the process of  adding PostgreSQL support for an application,
> in addition to Oracle and MS SQL.
> I am using PostgreSQL version 7.3.2, Red Hat 9.0 on Intel Pentium III
> board.
> 
> I have a query that generally looks like this:
> 
> SELECT t1.col1, t2.col1 FROM t1, t2 WHERE t1.x=t2.y AND t2.p='string'
> AND t2.q=1
> 
> This query is strikingly slow (about 100 sec when both t1 and t2 has
> about 1,200 records, compare with less than 4 sec with MS SQL and
> Oracle)

always send results of EXPLAIN ANALYZE if you ask for help on [PERFORM] 

knowing which indexes you have would also help.

and you should have run ANALYZE too.

-
Hannu


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


[PERFORM] explicit casting required for index use

2003-10-26 Thread Reece Hart




Here's the basic issue: PostgreSQL doesn't use indexes unless a query criterion is of exactly the same type as the index type. This occurs even when a cast would enable the use of an index and greatly improve performance. I understand that casting is needed to use an index and will therefore affect performance -- the part I don't understand is why postgresql doesn't automatically cast query arguments to the column type, thereby enabling indexes on that column.


I have a table that looks like this (extra cols, indexes, and fk constraints removed):

[EMAIL PROTECTED] \d paprospect2
    Table "unison.paprospect2"
   Column    |  Type   | Modifiers
-+-+---
 pseq_id | integer | not null
 run_id  | integer | not null
 pmodel_id   | integer | not null
 svm | real    |
Indexes: paprospect2_search1 btree (pmodel_id, run_id, svm),


I often search for pseq_ids based on all of pmodel_id, run_id, and svm threshold as below, hence the multi-column index.

Without an explicit cast of the svm criterion:

[EMAIL PROTECTED] explain select pseq_id from paprospect2 where pmodel_id=8210 and run_id=1 and svm>=11;
 Index Scan using paprospect2_search2 on paprospect2  (cost=0.00..43268.93 rows=2 width=4)
   Index Cond: ((pmodel_id = 8210) AND (run_id = 1))
   Filter: (svm >= 11::double precision)

And with an explicit cast to real (the same as the column type and indexed type):

[EMAIL PROTECTED] explain select pseq_id from paprospect2 where pmodel_id=8210 and run_id=1 and svm>=11::real;
 Index Scan using paprospect2_search1 on paprospect2  (cost=0.00..6.34 rows=2 width=4)
   Index Cond: ((pmodel_id = 8210) AND (run_id = 1) AND (svm >= 11::real))


Note two things above: 1) The explicit cast greatly reduces the predicted (and actual) cost. 2) The uncasted query eventually casts svm to double precision, which seems odd since the column itself is real (that is, it eventually does cast, but to the "wrong" type).

For small queries (returning ~10 rows), this is worth 100x in speed (9ms v. 990ms... in absolute terms, no big deal). For larger result sets (~200 rows), I've seen more like 1000x speed increases by using an explicit cast. For the larger queries, this can mean seconds versus many minutes.

Having to explicitly cast criterion is very non-intuitive. Moreover, it seems quite straightforward that PostgreSQL might incorporate casts (and perhaps even function calls like upper() for functional indexes) into its query strategy optimization. (I suppose functional indexes would apply only to immutable fx only, but that's fine.)

Thanks,
Reece




-- 
Reece Hart, Ph.D.   [EMAIL PROTECTED], http://www.gene.com/
Genentech, Inc. 650/225-6133 (voice), -5389 (fax)
Bioinformatics and Protein Engineering
1 DNA Way, MS-93http://www.in-machina.com/~reece/
South San Francisco, CA  94080-4990 [EMAIL PROTECTED], GPG: 0x25EC91A0