[PERFORM] Question about difference in performance of 2 queries on large table

2003-12-29 Thread Sean Shanny
To all,

The facts:

PostgreSQL 7.4.0 running on BSD 5.1 on Dell 2650 with 4GB RAM, 5 SCSI 
drives in hardware RAID 0 configuration.  Database size with indexes is 
currently 122GB.  Schema for the table in question is at the end of this 
email.  The DB has been vacuumed full and analyzed.  Between  2 and 3 
million records are added to the table in question each night.  An 
analyze is run on the entire DB after the data has been loaded each 
night.  There are no updates or deletes of records during the nightly 
load, only insertions.

I am trying to understand why the performance between the two queries 
below is so different. I am trying to find the count of all pages that 
have a 'valid' content_key.  -1 is our 'we don't have any content' key.  
The first plan below has horrendous performance.  we only get about 2% 
CPU usage and iostat shows 3-5 MB/sec IO.  The second plan runs at 30% 
cpu and 15-30MB.sec IO. 

Could someone shed some light on why the huge difference in 
performance?  Both are doing index scans plus a filter.  We have no 
content_keys below -1 at this time so the queries return the same results.

Thanks.

--sean

explain select count (distinct (persistent_cookie_key) ) from 
f_pageviews where date_key between 305 and 334 and content_key > -1;
QUERY PLAN

Aggregate  (cost=688770.29..688770.29 rows=1 width=4)
  ->  Index Scan using idx_pageviews_content on f_pageviews  
(cost=0.00..645971.34 rows=17119580 width=4)
Index Cond: (content_key > -1)
Filter: ((date_key >= 305) AND (date_key <= 334))
(4 rows)

explain select count (distinct (persistent_cookie_key) ) from 
f_pageviews where date_key between 305 and 334 and content_key <> -1;
   QUERY PLAN
--
Aggregate  (cost=1365419.12..1365419.12 rows=1 width=4)
  ->  Index Scan using idx_pageviews_date_nov_2003 on f_pageviews  
(cost=0.00..1322615.91 rows=17121284 width=4)
Index Cond: ((date_key >= 305) AND (date_key <= 334))
Filter: (content_key <> -1)
(4 rows)

\d f_pageviews
  Table "public.f_pageviews"
Column |  Type   |  Modifiers
+-+-
id | integer | not null default 
nextval('public.f_pageviews_id_seq'::text)
date_key   | integer | not null
time_key   | integer | not null
content_key| integer | not null
location_key   | integer | not null
session_key| integer | not null
subscriber_key | text| not null
persistent_cookie_key  | integer | not null
ip_key | integer | not null
referral_key   | integer | not null
servlet_key| integer | not null
tracking_key   | integer | not null
provider_key   | text| not null
marketing_campaign_key | integer | not null
orig_airport   | text| not null
dest_airport   | text| not null
commerce_page  | boolean | not null default false
job_control_number | integer | not null
sequenceid | integer | not null default 0
url_key| integer | not null
useragent_key  | integer | not null
web_server_name| text| not null default 'Not Available'::text
cpc| integer | not null default 0
referring_servlet_key  | integer | default 1
first_page_key | integer | default 1
newsletterid_key   | text| not null default 'Not Available'::text
Indexes:
   "f_pageviews_pkey" primary key, btree (id)
   "idx_pageviews_content" btree (content_key)
   "idx_pageviews_date_dec_2003" btree (date_key) WHERE ((date_key >= 
335) AND (date_key <= 365))
   "idx_pageviews_date_nov_2003" btree (date_key) WHERE ((date_key >= 
304) AND (date_key <= 334))
   "idx_pageviews_referring_servlet" btree (referring_servlet_key)
   "idx_pageviews_servlet" btree (servlet_key)
   "idx_pageviews_session" btree (session_key)



---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Question about difference in performance of 2 queries on large table

2003-12-29 Thread Tom Lane
Please show EXPLAIN ANALYZE output for your queries, not just EXPLAIN.
Also it would be useful to see the pg_stats rows for the date_key and
content_key columns.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Question about difference in performance of 2 queries

2003-12-29 Thread Dennis Bjorklund
On Mon, 29 Dec 2003, Sean Shanny wrote:

> The first plan below has horrendous performance.  we only get about 2% 
> CPU usage and iostat shows 3-5 MB/sec IO.  The second plan runs at 30% 
> cpu and 15-30MB.sec IO. 
> 
> Could someone shed some light on why the huge difference in 
> performance?  Both are doing index scans plus a filter.  We have no 
> content_keys below -1 at this time so the queries return the same results.

EXPLAIN ANALYZE gives more information then EXPLAIN, and is prefered.

It uses different indexes in the two queries, and one seems to be 
faster then the other. Why, I can't tell yet.

I would assume that you would get the fastet result if you had an index 

   (content_key, date_key)

I don't know if pg will even use an index to speed up a <> operation. When 
you had > then it could use the idx_pageviews_content index. Why it choose 
that when the other would be faster I don't know. Maybe explain analyze 
will give some hint.

-- 
/Dennis


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Question about difference in performance of 2 queries

2003-12-29 Thread Sean Shanny
I am running explain analyze now and will post results as they finish.

Thanks.

--sean

Tom Lane wrote:

Please show EXPLAIN ANALYZE output for your queries, not just EXPLAIN.
Also it would be useful to see the pg_stats rows for the date_key and
content_key columns.
			regards, tom lane

 



---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [PERFORM] Question about difference in performance of 2 queries

2003-12-29 Thread Sean Shanny
Here is the pg_stats data.  The explain analyze queries are still running.

select * from pg_stats where tablename = 'f_pageviews' and attname = 
'date_key';
schemaname |  tablename  | attname  | null_frac | avg_width | 
n_distinct | most_common_vals  
| 
most_common_freqs 
|   histogram_bounds| correlation
+-+--+---+---++---+---+---+-
public | f_pageviews | date_key | 0 | 4 | 
60 | {335,307,309,336,308,321,314,342,322,316} | 
{0.028,0.024,0.024,0.024,0.024,0.023,0.0226667,0.0226667,0.022,0.0216667} 
| {304,311,318,325,329,334,341,346,351,356,363} |0.345026
(1 row)

select * from pg_stats where tablename = 'f_pageviews' and attname = 
'content_key';
schemaname |  tablename  |   attname   | null_frac | avg_width | 
n_distinct | most_common_vals |   most_common_freqs   
|  
histogram_bounds   | correlation
+-+-+---+---++--+---+-+-
public | f_pageviews | content_key | 0 | 4 |
983 | {-1,1528483} | {0.749333,0.0017} | 
{38966,323835,590676,717061,919148,1091875,1208244,1299702,1375366,1434079,1528910} 
|0.103399
(1 row)

Thanks.

--sean

Tom Lane wrote:

Please show EXPLAIN ANALYZE output for your queries, not just EXPLAIN.
Also it would be useful to see the pg_stats rows for the date_key and
content_key columns.
			regards, tom lane

 



---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Question about difference in performance of 2 queries

2003-12-29 Thread Sean Shanny
Here is one of the explain analyzes.  This is the from the faster 
query.  Ignore the total runtime as we are currently doing other queries 
on this machine so it is slightly loaded.

Thanks.

--sean

explain analyze select count (distinct (persistent_cookie_key) ) from 
f_pageviews where date_key between 305 and 334 and content_key <> -1;
   
QUERY PLAN
   
--
Aggregate  (cost=1384925.95..1384925.95 rows=1 width=4) (actual 
time=4541462.030..4541462.034 rows=1 loops=1)
  ->  Index Scan using idx_pageviews_date_nov_2003 on f_pageviews  
(cost=0.00..1343566.52 rows=16543772 width=4) (actual 
time=83.267..4286664.678 rows=15710722 loops=1)
Index Cond: ((date_key >= 305) AND (date_key <= 334))
Filter: (content_key <> -1)
Total runtime: 4541550.832 ms
(5 rows)

Tom Lane wrote:

Please show EXPLAIN ANALYZE output for your queries, not just EXPLAIN.
Also it would be useful to see the pg_stats rows for the date_key and
content_key columns.
			regards, tom lane

 



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Question about difference in performance of 2 queries

2003-12-29 Thread Tom Lane
Sean Shanny <[EMAIL PROTECTED]> writes:
> Here is the pg_stats data.  The explain analyze queries are still running.

> select * from pg_stats where tablename = 'f_pageviews' and attname = 
> 'content_key';
>  schemaname |  tablename  |   attname   | null_frac | avg_width | 
> n_distinct | most_common_vals |   most_common_freqs   
> |  
> histogram_bounds   | correlation
> +-+-+---+---++--+---+-+-
>  public | f_pageviews | content_key | 0 | 4 |
> 983 | {-1,1528483} | {0.749333,0.0017} | 

Oh-ho, I see the problem: about 75% of your table has content_key = -1.

Why is that a problem, you ask?  Well, the planner realizes that
"content_key > -1" is a pretty good restriction condition (better than
the date condition, apparently) and so it tries to use that as the index
scan condition.  The problem is that in 7.4 and before, the btree index
code implements a "> -1" scan starting boundary by finding the first -1
and then advancing to the first key that's not -1.  So you end up
scanning through 75% of the index before anything useful happens :-(

I just fixed this poor behavior in CVS tip a couple weeks ago:
http://archives.postgresql.org/pgsql-committers/2003-12/msg00220.php
but the patch seems too large and unproven to risk back-patching into
7.4.*.

If you expect that a pretty large fraction of your data will always have
dummy content_key, it'd probably be worth changing the index to not
index -1's at all --- that is, make it a partial index with the
condition "WHERE content_key > -1".  Another workaround is to leave the
index as-is but phrase the query WHERE condition as "content_key >= 0"
instead of "> -1".

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Question about difference in performance of 2 queries

2003-12-29 Thread Sean Shanny
Tom,

Thanks.  I will make the changes you suggest concerning the indexes.  I 
am finding partial indexes to be very handy.  :-)

I canceled the explain analyze on the other query as we have found the 
problem and who knows how long it would take to complete.

Thanks again.

--sean

Tom Lane wrote:

Sean Shanny <[EMAIL PROTECTED]> writes:
 

Here is the pg_stats data.  The explain analyze queries are still running.
   

 

select * from pg_stats where tablename = 'f_pageviews' and attname = 
'content_key';
schemaname |  tablename  |   attname   | null_frac | avg_width | 
n_distinct | most_common_vals |   most_common_freqs   
|  
histogram_bounds   | correlation
+-+-+---+---++--+---+-+-
public | f_pageviews | content_key | 0 | 4 |
983 | {-1,1528483} | {0.749333,0.0017} | 
   

Oh-ho, I see the problem: about 75% of your table has content_key = -1.

Why is that a problem, you ask?  Well, the planner realizes that
"content_key > -1" is a pretty good restriction condition (better than
the date condition, apparently) and so it tries to use that as the index
scan condition.  The problem is that in 7.4 and before, the btree index
code implements a "> -1" scan starting boundary by finding the first -1
and then advancing to the first key that's not -1.  So you end up
scanning through 75% of the index before anything useful happens :-(
I just fixed this poor behavior in CVS tip a couple weeks ago:
http://archives.postgresql.org/pgsql-committers/2003-12/msg00220.php
but the patch seems too large and unproven to risk back-patching into
7.4.*.
If you expect that a pretty large fraction of your data will always have
dummy content_key, it'd probably be worth changing the index to not
index -1's at all --- that is, make it a partial index with the
condition "WHERE content_key > -1".  Another workaround is to leave the
index as-is but phrase the query WHERE condition as "content_key >= 0"
instead of "> -1".
			regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html

 



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


[PERFORM] deferred foreign keys

2003-12-29 Thread Vivek Khera
I'm observing that when I have many processes doing some work on my
system that the transactions run along almost in lockstep.  It appears
from messages posted here that the foreign keys are acquiring and
holding locks during the transactions, which seems like it would cause
this behavior.

I'd like to experiment with deferred foreign key checks so that the
lock is only held during the commit when the checks are done.

My questions are:

1) can I, and if so, how do I convert my existing FK's to deferrable
   without drop/create of the keys.  Some of the keys take a long time
   to create and I'd like to avoid the hit.

2) do I increase the liklihood of deadlocks when the FK locks are
   being acquired or is it just as likeley as with the current
   non-deferred checking?

I'm running 7.4 (soon to be 7.4.1)


-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Vivek Khera, Ph.D.Khera Communications, Inc.
Internet: [EMAIL PROTECTED]   Rockville, MD  +1-301-869-4449 x806
AIM: vivekkhera Y!: vivek_khera   http://www.khera.org/~vivek/

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match