[PERFORM] Index Backward Scan fast / Index Scan slow !

2004-04-12 Thread Pailloncy Jean-Gérard
Hi,

I test many times the foolowing query.

dps=# explain analyze select next_index_time from url order by  
next_index_time desc limit 1;
  
QUERY PLAN
 
 

 Limit  (cost=0.00..2.62 rows=1 width=4) (actual time=56.615..56.616  
rows=1 loops=1)
   ->  Index Scan Backward using url_next_index_time on url   
(cost=0.00..768529.55 rows=293588 width=4) (actual time=56.610..56.610  
rows=1 loops=1)
 Total runtime: 56.669 ms
(3 rows)

dps=# explain analyze select next_index_time from url order by  
next_index_time asc limit 1;

QUERY PLAN
 
 
-
 Limit  (cost=0.00..2.62 rows=1 width=4) (actual  
time=94879.636..94879.637 rows=1 loops=1)
   ->  Index Scan using url_next_index_time on url   
(cost=0.00..768529.55 rows=293588 width=4) (actual  
time=94879.631..94879.631 rows=1 loops=1)
 Total runtime: 94879.688 ms
(3 rows)

How to optimize the last query ? (~ 2000 times slower than the first  
one)
I suppose there is some odd distribution of data in the index ?
Is the solution to reindex data ?

Cordialement,
Jean-Gérard Pailloncy
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


[PERFORM] Effect of too many columns

2004-04-12 Thread Rajesh Kumar Mallah
Greetings,

Is there any performance penalty of having too many columns in
a table in terms of read and write speeds.
To order to keep operational queries simple (avoid joins) we plan to
add columns in the main customer dimension table.
Adding more columns also means increase in concurrency in the table
as more and more applications will access the same table.
Any ideas if its better to split the table application wise or is it ok?



Regds
mallah.
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Effect of too many columns

2004-04-12 Thread Bruno Wolff III
On Mon, Apr 12, 2004 at 17:24:17 +0530,
  Rajesh Kumar Mallah <[EMAIL PROTECTED]> wrote:
> 
> Greetings,
> 
> Is there any performance penalty of having too many columns in
> a table in terms of read and write speeds.
> 
> To order to keep operational queries simple (avoid joins) we plan to
> add columns in the main customer dimension table.
> 
> Adding more columns also means increase in concurrency in the table
> as more and more applications will access the same table.
> 
> Any ideas if its better to split the table application wise or is it ok?

This is normally a bad idea. If you properly implement constraints in
what is effectively a materialized view, you might end up with a slower
system, depending on your mix of queries. (Generally updating will take
more resources.) So you probably want to test your new design under a
simulated normal load to see if it actually speeds things up in your
case before making the change.

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Index Backward Scan fast / Index Scan slow !

2004-04-12 Thread Tom Lane
=?ISO-8859-1?Q?Pailloncy_Jean-G=E9rard?= <[EMAIL PROTECTED]> writes:
> How to optimize the last query ? (~ 2000 times slower than the first  
> one)
> I suppose there is some odd distribution of data in the index ?

Looks to me like a whole lot of dead rows at the left end of the index.
Have you VACUUMed this table lately?  It would be interesting to see
what VACUUM VERBOSE has to say about it.

> Is the solution to reindex data ?

In 7.4 a VACUUM should be sufficient ... or at least, if it isn't
I'd like to know why not before you destroy the evidence by reindexing.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


[PERFORM] Deleting certain duplicates

2004-04-12 Thread Shea,Dan [CIS]
We have a large database which recently increased dramatically due to a
change in our insert program allowing all entries.
PWFPM_DEV=# select relname,relfilenode,reltuples from pg_class where relname
= 'forecastelement';
 relname | relfilenode |  reltuples
-+-+-
 forecastelement |   361747866 | 4.70567e+08

 Column |Type | Modifiers
+-+---
 version| character varying(99)   |
 origin | character varying(10)   |
 timezone   | character varying(99)   |
 region_id  | character varying(20)   |
 wx_element | character varying(99)   |
 value  | character varying(99)   |
 flag   | character(3)|
 units  | character varying(99)   |
 valid_time | timestamp without time zone |
 issue_time | timestamp without time zone |
 next_forecast  | timestamp without time zone |
 reception_time | timestamp without time zone |

The program is supposed to check to ensure that all fields but the
reception_time are unique using a select statement, and if so, insert it.
Due an error in a change, reception time was included in the select to check
for duplicates.  The reception_time is created by a program creating the dat
file to insert. 
Essentially letting all duplicate files to be inserted.

I tried the delete query below.
PWFPM_DEV=# delete from forecastelement where oid not in (select min(oid)
from forecastelement group by
version,origin,timezone,region_id,wx_element,value,flag,units,valid_time,iss
ue_time,next_forecast);
It ran for 3 days creating what I assume is an index in pgsql_tmp of the
group by statement. 
The query ended up failing with "dateERROR:write failed".
Well the long weekend is over and we do not have the luxury of trying this
again. 
So I was thinking maybe of doing the deletion in chunks, perhaps based on
reception time.
Are there any suggestions for a better way to do this, or using multiple
queries to delete selectively a week at a time based on the reception_time.
I would say there are a lot of duplicate entries between mid march to the
first week of April.



---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Deleting certain duplicates

2004-04-12 Thread Shea,Dan [CIS]
The index is
Indexes:
"forecastelement_rwv_idx" btree (region_id, wx_element, valid_time)

-Original Message-
From: Shea,Dan [CIS] [mailto:[EMAIL PROTECTED]
Sent: Monday, April 12, 2004 10:39 AM
To: Postgres Performance
Subject: [PERFORM] Deleting certain duplicates


We have a large database which recently increased dramatically due to a
change in our insert program allowing all entries.
PWFPM_DEV=# select relname,relfilenode,reltuples from pg_class where relname
= 'forecastelement';
 relname | relfilenode |  reltuples
-+-+-
 forecastelement |   361747866 | 4.70567e+08

 Column |Type | Modifiers
+-+---
 version| character varying(99)   |
 origin | character varying(10)   |
 timezone   | character varying(99)   |
 region_id  | character varying(20)   |
 wx_element | character varying(99)   |
 value  | character varying(99)   |
 flag   | character(3)|
 units  | character varying(99)   |
 valid_time | timestamp without time zone |
 issue_time | timestamp without time zone |
 next_forecast  | timestamp without time zone |
 reception_time | timestamp without time zone |

The program is supposed to check to ensure that all fields but the
reception_time are unique using a select statement, and if so, insert it.
Due an error in a change, reception time was included in the select to check
for duplicates.  The reception_time is created by a program creating the dat
file to insert. 
Essentially letting all duplicate files to be inserted.

I tried the delete query below.
PWFPM_DEV=# delete from forecastelement where oid not in (select min(oid)
from forecastelement group by
version,origin,timezone,region_id,wx_element,value,flag,units,valid_time,iss
ue_time,next_forecast);
It ran for 3 days creating what I assume is an index in pgsql_tmp of the
group by statement. 
The query ended up failing with "dateERROR:write failed".
Well the long weekend is over and we do not have the luxury of trying this
again. 
So I was thinking maybe of doing the deletion in chunks, perhaps based on
reception time.
Are there any suggestions for a better way to do this, or using multiple
queries to delete selectively a week at a time based on the reception_time.
I would say there are a lot of duplicate entries between mid march to the
first week of April.



---(end of broadcast)---
TIP 8: explain analyze is your friend

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[PERFORM] index v. seqscan for certain values

2004-04-12 Thread Jeremy Dunn
Title: Message



I've searched the 
archives and can't find an answer to this seemingly simple question.  
Apologies if it's too common.
 
The table in 
question has ~1.3M rows.  It has 85 columns, 5 of which have single-column 
indexes.
 
The column in 
question (CID) has 183 distinct values.  For these values, the largest has 
~38,000 rows, and the smallest has 1 row.  About 30 values have < 100 
rows, and about 10 values have > 20,000 rows.
 
The database is 
7.2.3 running on RedHat 7.1. (we are in process of upgrading to 
PG 7.4.2)    All of the query plan options are enabled, and 
the cpu costs are set to the default values. ( cpu_tuple_cost is 0.01, 
cpu_index_tuple_cost is 0.001).  The database is VACUUM'd every 
night.
 
The 
problem:
A simply 
query:
    select count(*) from xxx where CID=
where 
 is a CID value which has relatively few rows, returns a plan 
using the index on that column.
 
   explain 
analyze select count(*) from xxx where cid=869366;   Aggregate  
(cost=19136.33..19136.33 rows=1 width=0) (actual time=78.49..78.49 rows=1 
loops=1)     ->  Index Scan using xxx_cid on 
emailrcpts  (cost=0.00..19122.21 rows=5648 width=0) (actual 
time=63.40..78.46 rows=1 loops=1)   Total runtime: 78.69 
msec
 
The same plan is 
true for values which have up to about 20,000 rows:
 
   explain 
analyze select count(*) from xxx where cid=6223341;   Aggregate  
(cost=74384.19..74384.19 rows=1 width=0) (actual time=11614.89..11614.89 rows=1 
loops=1)     ->  Index Scan using xxx_cid on 
emailrcpts  (cost=0.00..74329.26 rows=21974 width=0) (actual 
time=35.75..11582.10 rows=20114 loops=1)   Total runtime: 11615.05 
msec
However for the 
values that have > 20,000 rows, the plan changes to a sequential scan, which 
is proportionately much slower.
 
   
explain analyze select count(*) from xxx where cid=7191032;   
Aggregate  (cost=97357.61..97357.61 rows=1 width=0) (actual 
time=46427.81..46427.82 rows=1 loops=1)    -> 
  Seq Scan on xxx 
(cost=0.00..97230.62 rows=50792 width=0) (actual time=9104.45..46370.27 
rows=37765 loops=1)    
Total runtime: 46428.00 msec
 
 
The question: why 
does the planner consider a sequential scan to be better for these top 10 
values?  In terms of elapsed time it is more than twice as slow, 
proportionate to an index scan for the same number of rows.
 
What I 
tried:
 
A) alter table xxx alter column cid set statistics 500;    

    analyze xxx;
This does not affect 
the results.
 
B) 
 dropped/rebuilt the index, with no improvement.
 
C) decreasing 
cpu_index_tuple_cost by a factor of up to 1000, with no 
success
 
D) force an index 
scan for the larger values by using a very high value for cpu_tuple_cost (e.g. 
.5) but this doesn't seem like a wise thing to do.
 
Your thoughts 
appreciated in advance!
 
- 
Jeremy 
 
7+ years 
experience in Oracle performance-tuning
relatively new to postgresql




Re: [PERFORM] index v. seqscan for certain values

2004-04-12 Thread Bill Moran
Quick bit of input, since you didn't mention it.

How often do you run ANALYZE?  I found it interesting that a database I
was doing tests on sped up by a factor of 20 after ANALYZE.  If your
data changes a lot, you should probably schedule ANALYZE to run with
VACUUM.
Jeremy Dunn wrote:
I've searched the archives and can't find an answer to this seemingly 
simple question.  Apologies if it's too common.
 
The table in question has ~1.3M rows.  It has 85 columns, 5 of which 
have single-column indexes.
 
The column in question (CID) has 183 distinct values.  For these values, 
the largest has ~38,000 rows, and the smallest has 1 row.  About 30 
values have < 100 rows, and about 10 values have > 20,000 rows.
 
The database is 7.2.3 running on RedHat 7.1. (we are in process of 
upgrading to PG 7.4.2)All of the query plan options are enabled, and 
the cpu costs are set to the default values. ( cpu_tuple_cost is 0.01, 
cpu_index_tuple_cost is 0.001).  The database is VACUUM'd every night.
 
The problem:
A simply query:
select count(*) from xxx where CID=
where  is a CID value which has relatively few rows, returns a 
plan using the index on that column.
 
   explain analyze select count(*) from xxx where cid=869366;
   Aggregate  (cost=19136.33..19136.33 rows=1 width=0) (actual 
time=78.49..78.49 rows=1 loops=1)
 ->  Index Scan using xxx_cid on emailrcpts  (cost=0.00..19122.21 
rows=5648 width=0) (actual time=63.40..78.46 rows=1 loops=1)
   Total runtime: 78.69 msec
 
The same plan is true for values which have up to about 20,000 rows:
 
   explain analyze select count(*) from xxx where cid=6223341;
   Aggregate  (cost=74384.19..74384.19 rows=1 width=0) (actual 
time=11614.89..11614.89 rows=1 loops=1)
 ->  Index Scan using xxx_cid on emailrcpts  (cost=0.00..74329.26 
rows=21974 width=0) (actual time=35.75..11582.10 rows=20114 loops=1)
   Total runtime: 11615.05 msec
However for the values that have > 20,000 rows, the plan changes to a 
sequential scan, which is proportionately much slower.
 
   explain analyze select count(*) from xxx where cid=7191032;
   Aggregate  (cost=97357.61..97357.61 rows=1 width=0) (actual 
time=46427.81..46427.82 rows=1 loops=1)
->   Seq Scan on xxx (cost=0.00..97230.62 rows=50792 width=0) 
(actual time=9104.45..46370.27 rows=37765 loops=1)
Total runtime: 46428.00 msec
 
 
The question: why does the planner consider a sequential scan to be 
better for these top 10 values?  In terms of elapsed time it is more 
than twice as slow, proportionate to an index scan for the same number 
of rows.
 
What I tried:
 
A) alter table xxx alter column cid set statistics 500;   
analyze xxx;
This does not affect the results.
 
B)  dropped/rebuilt the index, with no improvement.
 
C) decreasing cpu_index_tuple_cost by a factor of up to 1000, with no 
success
 
D) force an index scan for the larger values by using a very high value 
for cpu_tuple_cost (e.g. .5) but this doesn't seem like a wise thing to do.
 
Your thoughts appreciated in advance!
 
- Jeremy 
 
7+ years experience in Oracle performance-tuning
relatively new to postgresql


--
Bill Moran
Potential Technologies
http://www.potentialtech.com
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] index v. seqscan for certain values

2004-04-12 Thread Jeremy Dunn
Sorry I should have written that we do VACUUM VERBOSE ANALYZE every
night.

- Jeremy

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Bill Moran
Sent: Monday, April 12, 2004 12:09 PM
To: [EMAIL PROTECTED]
Cc: Postgresql Performance
Subject: Re: [PERFORM] index v. seqscan for certain values


Quick bit of input, since you didn't mention it.

How often do you run ANALYZE?  I found it interesting that a database I
was doing tests on sped up by a factor of 20 after ANALYZE.  If your
data changes a lot, you should probably schedule ANALYZE to run with
VACUUM.

Jeremy Dunn wrote:
> I've searched the archives and can't find an answer to this seemingly
> simple question.  Apologies if it's too common.
>  
> The table in question has ~1.3M rows.  It has 85 columns, 5 of which
> have single-column indexes.
>  
> The column in question (CID) has 183 distinct values.  For these 
> values,
> the largest has ~38,000 rows, and the smallest has 1 row.  About 30 
> values have < 100 rows, and about 10 values have > 20,000 rows.
>  
> The database is 7.2.3 running on RedHat 7.1. (we are in process of 
> upgrading to PG 7.4.2)All of the query plan options are enabled,
and 
> the cpu costs are set to the default values. ( cpu_tuple_cost is 0.01,
> cpu_index_tuple_cost is 0.001).  The database is VACUUM'd every night.
>  
> The problem:
> A simply query:
> select count(*) from xxx where CID=
> where  is a CID value which has relatively few rows, returns 
> a
> plan using the index on that column.
>  
>explain analyze select count(*) from xxx where cid=869366;
>Aggregate  (cost=19136.33..19136.33 rows=1 width=0) (actual
> time=78.49..78.49 rows=1 loops=1)
>  ->  Index Scan using xxx_cid on emailrcpts  (cost=0.00..19122.21 
> rows=5648 width=0) (actual time=63.40..78.46 rows=1 loops=1)
>Total runtime: 78.69 msec
>  
> The same plan is true for values which have up to about 20,000 rows:
>  
>explain analyze select count(*) from xxx where cid=6223341;
>Aggregate  (cost=74384.19..74384.19 rows=1 width=0) (actual
> time=11614.89..11614.89 rows=1 loops=1)
>  ->  Index Scan using xxx_cid on emailrcpts  (cost=0.00..74329.26 
> rows=21974 width=0) (actual time=35.75..11582.10 rows=20114 loops=1)
>Total runtime: 11615.05 msec
> However for the values that have > 20,000 rows, the plan changes to a 
> sequential scan, which is proportionately much slower.
>  
>explain analyze select count(*) from xxx where cid=7191032;
>Aggregate  (cost=97357.61..97357.61 rows=1 width=0) (actual
> time=46427.81..46427.82 rows=1 loops=1)
> ->   Seq Scan on xxx (cost=0.00..97230.62 rows=50792 width=0) 
> (actual time=9104.45..46370.27 rows=37765 loops=1)
> Total runtime: 46428.00 msec
>  
>  
> The question: why does the planner consider a sequential scan to be
> better for these top 10 values?  In terms of elapsed time it is more 
> than twice as slow, proportionate to an index scan for the same number

> of rows.
>  
> What I tried:
>  
> A) alter table xxx alter column cid set statistics 500;   
> analyze xxx;
> This does not affect the results.
>  
> B)  dropped/rebuilt the index, with no improvement.
>  
> C) decreasing cpu_index_tuple_cost by a factor of up to 1000, with no
> success
>  
> D) force an index scan for the larger values by using a very high 
> value
> for cpu_tuple_cost (e.g. .5) but this doesn't seem like a wise thing
to do.
>  
> Your thoughts appreciated in advance!
>  
> - Jeremy
>  
> 7+ years experience in Oracle performance-tuning
> relatively new to postgresql


-- 
Bill Moran
Potential Technologies
http://www.potentialtech.com


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] index v. seqscan for certain values

2004-04-12 Thread Stephan Szabo

On Mon, 12 Apr 2004, Jeremy Dunn wrote:

>explain analyze select count(*) from xxx where cid=6223341;
>Aggregate  (cost=74384.19..74384.19 rows=1 width=0) (actual
> time=11614.89..11614.89 rows=1 loops=1)
>  ->  Index Scan using xxx_cid on emailrcpts  (cost=0.00..74329.26
> rows=21974 width=0) (actual time=35.75..11582.10 rows=20114 loops=1)
>Total runtime: 11615.05 msec
>
> However for the values that have > 20,000 rows, the plan changes to a
> sequential scan, which is proportionately much slower.
>
>explain analyze select count(*) from xxx where cid=7191032;
>Aggregate  (cost=97357.61..97357.61 rows=1 width=0) (actual
> time=46427.81..46427.82 rows=1 loops=1)
> ->   Seq Scan on xxx (cost=0.00..97230.62 rows=50792 width=0)
> (actual time=9104.45..46370.27 rows=37765 loops=1)
> Total runtime: 46428.00 msec
>
> The question: why does the planner consider a sequential scan to be
> better for these top 10 values?  In terms of elapsed time it is more
> than twice as slow, proportionate to an index scan for the same number
> of rows.

One thing to do is to set enable_seqscan=off and run the above and compare
the estimated and real costs.  It may be possible to lower
random_page_cost to a still reasonable number in order to move the point
of the switchover to seqscan.

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] index v. seqscan for certain values

2004-04-12 Thread Tom Lane
"Jeremy Dunn" <[EMAIL PROTECTED]> writes:
> The question: why does the planner consider a sequential scan to be
> better for these top 10 values?

At some point a seqscan *will* be better.  In the limit, if the key
being sought is common enough to occur on every page of the table,
it's certain that a seqscan will require less I/O than an indexscan
(because reading the index isn't actually saving you any heap fetches).
In practice the breakeven point is less than that because Unix kernels
are better at handling sequential than random access.

Your gripe appears to be basically that the planner's idea of the
breakeven point is off a bit.  It looks to me like it's within about
a factor of 2 of being right, though, which is not all that bad when
it's using generic cost parameters.

> A) alter table xxx alter column cid set statistics 500;
> analyze xxx;
> This does not affect the results.

It probably improved the accuracy of the row count estimates, no?
The estimate you show for cid=7191032 is off by more than 25% (37765 vs
50792), which seems like a lot of error for one of the most common
values in the table.  (I hope that was with default stats target and
not 500.)  That leads directly to a 25% overestimate of the cost of
an indexscan, while having IIRC no impact on the cost of a seqscan.
Since the cost ratio was more than 25%, this didn't change the selected
plan, but you want to fix that error as best you can before you move
on to tweaking cost parameters.

> C) decreasing cpu_index_tuple_cost by a factor of up to 1000, with no
> success

Wrong thing.  You should be tweaking random_page_cost.  Looks to me like
a value near 2 might be appropriate for your setup.  Also it is likely
appropriate to increase effective_cache_size, which is awfully small in
the default configuration.  I'd set that to something related to your
available RAM before trying to home in on a suitable random_page_cost.

AFAIK hardly anyone bothers with changing the cpu_xxx costs ...

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Index Backward Scan fast / Index Scan slow !

2004-04-12 Thread Pailloncy Jean-Gérard
Hi,

How to optimize the last query ? (~ 2000 times slower than the first
one)
I suppose there is some odd distribution of data in the index ?
Looks to me like a whole lot of dead rows at the left end of the index.
Have you VACUUMed this table lately?
From pg_autovacuum:
[2004-04-10 05:45:39 AM] Performing: ANALYZE "public"."url"
[2004-04-10 11:13:25 AM] Performing: ANALYZE "public"."url"
[2004-04-10 03:12:14 PM] Performing: VACUUM ANALYZE "public"."url"
[2004-04-11 04:58:29 AM] Performing: ANALYZE "public"."url"
[2004-04-11 03:48:25 PM] Performing: ANALYZE "public"."url"
[2004-04-11 09:21:31 PM] Performing: ANALYZE "public"."url"
[2004-04-12 03:24:06 AM] Performing: ANALYZE "public"."url"
[2004-04-12 07:20:08 AM] Performing: VACUUM ANALYZE "public"."url"
 It would be interesting to see
what VACUUM VERBOSE has to say about it.
dps=# VACUUM VERBOSE url;
INFO:  vacuuming "public.url"
INFO:  index "url_pkey" now contains 348972 row versions in 2344 pages
DETAIL:  229515 index row versions were removed.
41 index pages have been deleted, 41 are currently reusable.
CPU 0.32s/1.40u sec elapsed 70.66 sec.
INFO:  index "url_crc" now contains 215141 row versions in 497 pages
DETAIL:  108343 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.06s/0.96u sec elapsed 9.13 sec.
INFO:  index "url_seed" now contains 348458 row versions in 2987 pages
DETAIL:  229515 index row versions were removed.
345 index pages have been deleted, 345 are currently reusable.
CPU 0.40s/2.38u sec elapsed 74.26 sec.
INFO:  index "url_referrer" now contains 349509 row versions in 1964 
pages
DETAIL:  229515 index row versions were removed.
65 index pages have been deleted, 65 are currently reusable.
CPU 0.34s/1.53u sec elapsed 127.37 sec.
INFO:  index "url_next_index_time" now contains 349519 row versions in 
3534 pages
DETAIL:  229515 index row versions were removed.
3071 index pages have been deleted, 2864 are currently reusable.
CPU 0.32s/0.67u sec elapsed 76.25 sec.
INFO:  index "url_status" now contains 349520 row versions in 3465 pages
DETAIL:  229515 index row versions were removed.
2383 index pages have been deleted, 2256 are currently reusable.
CPU 0.35s/0.85u sec elapsed 89.25 sec.
INFO:  index "url_bad_since_time" now contains 349521 row versions in 
2017 pages
DETAIL:  229515 index row versions were removed.
38 index pages have been deleted, 38 are currently reusable.
CPU 0.54s/1.46u sec elapsed 83.77 sec.
INFO:  index "url_hops" now contains 349620 row versions in 3558 pages
DETAIL:  229515 index row versions were removed.
1366 index pages have been deleted, 1356 are currently reusable.
CPU 0.43s/0.91u sec elapsed 132.14 sec.
INFO:  index "url_siteid" now contains 350551 row versions in 3409 pages
DETAIL:  229515 index row versions were removed.
2310 index pages have been deleted, 2185 are currently reusable.
CPU 0.35s/1.01u sec elapsed 85.08 sec.
INFO:  index "url_serverid" now contains 350552 row versions in 3469 
pages
DETAIL:  229515 index row versions were removed.
1014 index pages have been deleted, 1009 are currently reusable.
CPU 0.54s/1.01u sec elapsed 120.40 sec.
INFO:  index "url_url" now contains 346563 row versions in 6494 pages
DETAIL:  213608 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 1.35s/2.07u sec elapsed 285.05 sec.
INFO:  index "url_last_mod_time" now contains 346734 row versions in 
1106 pages
DETAIL:  213608 index row versions were removed.
27 index pages have been deleted, 17 are currently reusable.
CPU 0.17s/0.95u sec elapsed 17.92 sec.
INFO:  "url": removed 229515 row versions in 4844 pages
DETAIL:  CPU 0.53s/1.26u sec elapsed 375.64 sec.
INFO:  "url": found 229515 removable, 310913 nonremovable row versions 
in 26488 pages
DETAIL:  29063 dead row versions cannot be removed yet.
There were 3907007 unused item pointers.
192 pages are entirely empty.
CPU 7.78s/17.09u sec elapsed 3672.29 sec.
INFO:  vacuuming "pg_toast.pg_toast_127397204"
INFO:  index "pg_toast_127397204_index" now contains 0 row versions in 
1 pages
DETAIL:  0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.06 sec.
INFO:  "pg_toast_127397204": found 0 removable, 0 nonremovable row 
versions in 0 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.07 sec.
VACUUM

Is the solution to reindex data ?
In 7.4 a VACUUM should be sufficient ... or at least, if it isn't
I'd like to know why not before you destroy the evidence by reindexing.
Yes, of course.

Cordialement,
Jean-Gérard Pailloncy
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] index v. seqscan for certain values

2004-04-12 Thread Jeremy Dunn
> "Jeremy Dunn" <[EMAIL PROTECTED]> writes:
> > The question: why does the planner consider a sequential scan to be 
> > better for these top 10 values?
> 
> At some point a seqscan *will* be better.  In the limit, if 
> the key being sought is common enough to occur on every page 
> of the table, it's certain that a seqscan will require less 
> I/O than an indexscan (because reading the index isn't 
> actually saving you any heap fetches). In practice the 
> breakeven point is less than that because Unix kernels are 
> better at handling sequential than random access.
> 
> Your gripe appears to be basically that the planner's idea of 
> the breakeven point is off a bit.  It looks to me like it's 
> within about a factor of 2 of being right, though, which is 
> not all that bad when it's using generic cost parameters.

Agreed.  However, given that count(*) is a question that can be answered
_solely_ using the index (without reference to the actual data blocks),
I'd expect that the break-even point would be considerably higher than
the < 3% (~38,000 / ~1.3M) I'm currently getting.  Does PG not use
solely the index in this situation??

> > A) alter table xxx alter column cid set statistics 500;
> > analyze xxx;
> > This does not affect the results.
> 
> It probably improved the accuracy of the row count estimates, 
> no? The estimate you show for cid=7191032 is off by more than 
> 25% (37765 vs 50792), which seems like a lot of error for one 
> of the most common values in the table.  (I hope that was 
> with default stats target and not 500.)  That leads directly 
> to a 25% overestimate of the cost of an indexscan, while 
> having IIRC no impact on the cost of a seqscan. Since the 
> cost ratio was more than 25%, this didn't change the selected 
> plan, but you want to fix that error as best you can before 
> you move on to tweaking cost parameters.

Actually it made them worse!  Yes, this was the default statistics (10).
When I just tried it again with a value of 300, analyze, then run the
query, I get a *worse* result for an estimate.  I don't understand this.


   alter table xxx alter column cid set statistics 300;
   analyze emailrcpts;
   set random_page_cost to 2;
   explain analyze select count(*) from xxx where cid=7191032;

   Aggregate  (cost=20563.28..20563.28 rows=1 width=0) (actual
time=7653.90..7653.90 rows=1 loops=1)
  ->  Index Scan using xxx_cid on xxx  (cost=0.00..20535.82 rows=10983
width=0) (actual time=72.24..7602.38 rows=37765 loops=1)
   Total runtime: 7654.14 msec

Now it estimates I have only 10,983 rows (~3x too low) instead of the
old estimate 50,792 (1.3x too high).  Why is that ??

Anyway, a workable solution seems to be using a lower value for
Random_Page_Cost.  Thanks to everyone who replied with this answer. 

> Also it is likely appropriate to increase 
> effective_cache_size, which is awfully small in the default 
> configuration.  I'd set that to something related to your 
> available RAM before trying to home in on a suitable random_page_cost.

We have ours set to the default value of 1000, which does seem low for a
system with 1GB of RAM.  We'll up this once we figure out what's
available.  Then tweak the Random_Page_Cost appropriately at that point.

I'd still like to understand the strangeness above, if anyone can shed
light.

- Jeremy


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Index Backward Scan fast / Index Scan slow ! (Modifié par Pailloncy Jean-Gérard)

2004-04-12 Thread Pailloncy Jean-Gérard
Hi,

In 7.4 a VACUUM should be sufficient ... or at least, if it isn't
Atfer VACUUM:
dps=# explain analyze select next_index_time from url order by  
next_index_time desc limit 1;
 
QUERY PLAN
 
 
--
 Limit  (cost=0.00..2.62 rows=1 width=4) (actual time=0.098..0.099  
rows=1 loops=1)
   ->  Index Scan Backward using url_next_index_time on url   
(cost=0.00..814591.03 rows=310913 width=4) (actual time=0.096..0.096  
rows=1 loops=1)
 Total runtime: 0.195 ms
(3 rows)

dps=# explain analyze select next_index_time from url order by  
next_index_time asc limit 1;

QUERY PLAN
 
 
-
 Limit  (cost=0.00..2.62 rows=1 width=4) (actual  
time=13504.105..13504.106 rows=1 loops=1)
   ->  Index Scan using url_next_index_time on url   
(cost=0.00..814591.03 rows=310913 width=4) (actual  
time=13504.099..13504.099 rows=1 loops=1)
 Total runtime: 13504.158 ms
(3 rows)

Better, but..

Cordialement,
Jean-Gérard Pailloncy
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] index v. seqscan for certain values

2004-04-12 Thread Bruno Wolff III
On Mon, Apr 12, 2004 at 15:05:02 -0400,
  Jeremy Dunn <[EMAIL PROTECTED]> wrote:
> 
> Agreed.  However, given that count(*) is a question that can be answered
> _solely_ using the index (without reference to the actual data blocks),
> I'd expect that the break-even point would be considerably higher than
> the < 3% (~38,000 / ~1.3M) I'm currently getting.  Does PG not use
> solely the index in this situation??

That isn't true. In order to check visibility you need to look at the
data blocks.

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] index v. seqscan for certain values

2004-04-12 Thread Tom Lane
"Jeremy Dunn" <[EMAIL PROTECTED]> writes:
> Agreed.  However, given that count(*) is a question that can be answered
> _solely_ using the index (without reference to the actual data blocks),

As Bruno noted, that is not the case in Postgres; we must visit the
table rows anyway.

> When I just tried it again with a value of 300, analyze, then run the
> query, I get a *worse* result for an estimate.  I don't understand this.

That's annoying.  How repeatable are these results --- if you do ANALYZE
over again several times, how much does the row count estimate change
each time?  (It should change somewhat, since ANALYZE is taking a random
sample, but one would like to think not a whole lot.)  Is the variance
more or less at the higher stats target?  Take a look at a few different
CID values to get a sense of the accuracy, don't look at just one ...

(Actually, you might find it more profitable to look at the pg_stats
entry for the CID column rather than reverse-engineering the stats via
ANALYZE.  Look at how well the most-common-values list and associated
frequency numbers track reality.)

Also, can you think of any reason for the distribution of CID values
to be nonuniform within the table?  For instance, do rows get inserted
in order of increasing CID, or is there any clustering of rows with the
same CID?

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Index Backward Scan fast / Index Scan slow !

2004-04-12 Thread Tom Lane
[ Ah, I just got to your message with the VACUUM VERBOSE results ... ]

=?ISO-8859-1?Q?Pailloncy_Jean-G=E9rard?= <[EMAIL PROTECTED]> writes:
> dps=# VACUUM VERBOSE url;
> INFO:  index "url_next_index_time" now contains 349519 row versions in 
> 3534 pages
> DETAIL:  229515 index row versions were removed.
> 3071 index pages have been deleted, 2864 are currently reusable.
> CPU 0.32s/0.67u sec elapsed 76.25 sec.

Hm, this is odd.  That says you've got 349519 live index entries in only
463 actively-used index pages, or an average of 754 per page, which
AFAICS could not fit in an 8K page.  Are you using a nondefault value of
BLCKSZ?  If so what?

If you *are* using default BLCKSZ then this index must be corrupt, and
what you probably need to do is REINDEX it.  But before you do that,
could you send me a copy of the index file?

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: Re: [PERFORM] Index Backward Scan fast / Index Scan slow ! (Modifié par Pailloncy Jean-Gérard)

2004-04-12 Thread Tom Lane
=?ISO-8859-1?Q?Pailloncy_Jean-G=E9rard?= <[EMAIL PROTECTED]> writes:
>> In 7.4 a VACUUM should be sufficient ... or at least, if it isn't
> Atfer VACUUM:
> Better, but..

... but not much :-(.  Okay, could we see VACUUM VERBOSE results for
this table?

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


[PERFORM] Tracking down performance issue

2004-04-12 Thread Ken Geis
I'm running 7.4.2 on an older Linux box (450MHzAMD K-6-III, 450M RAM) 
running kernel 2.6.5.  My client is a Java/JDBC program on Windows.

I'm having trouble seeing where the bottleneck in my performance is. 
The client uses about 30% CPU.  The server uses 70% CPU plus 1.5% I/O 
wait.  The I/O wait is very low because I'm doing a PK index scan where 
the index and data are on different disks and the table is clustered on 
the PK index.  The network is 100Mb, and it's at 7% of capacity.

I tried making the client operate on two threads on two database 
connections.  That bumped the server utilization to 80% and barely 
changed the I/O wait.  The throughput decreased by a third.

The only thing I can think of is memory bandwidth.  Does anyone have 
tips on how I can investigate more?

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])