Re: [PERFORM] Question about difference in performance of 2 queries

2003-12-30 Thread Nicholas Shanny
Tom,

I understand the problem and your solution makes sense although I am  
still puzzled by the machine under-utilization. If you run the original  
query and monitor the IO/CPU usage you find that it is minimal.

here is the output from iostat 1 for a brief portion of the query. I am  
very curious to understand why when scanning the index the IO/CPU  
utilization is seemingly low.

Cheers
Nick Shanny
TripAdvisor, Inc.
  0   77 32.00 106  3.31   0.00   0  0.00   0.00   0  0.00   0  0  2  0  
98
   0   76 32.00 125  3.90   0.00   0  0.00   0.00   0  0.00   0  0  2   
0 97
   0   76 32.00 125  3.90   0.00   0  0.00   0.00   0  0.00   0  0  1   
1 98
   0   76 32.75 127  4.05   0.00   0  0.00   0.00   0  0.00   0  0  1   
0 99
  tty   aacd0 acd0  fd0  
cpu
 tin tout  KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s  us ni sy  
in id
   0   76 32.00 127  3.96   0.00   0  0.00   0.00   0  0.00   0  0  3   
0 97
   0  229 32.24 135  4.24   0.00   0  0.00   0.00   0  0.00   0  0  4   
0 95
   0   76 32.00 129  4.02   0.00   0  0.00   0.00   0  0.00   0  0  2   
0 97
   0   76 32.00 123  3.84   0.00   0  0.00   0.00   0  0.00   0  0  2   
0 98
   0   76 31.72 115  3.56   0.00   0  0.00   0.00   0  0.00   0  0  2   
0 98
   0   76 32.50 126  3.99   0.00   0  0.00   0.00   0  0.00   0  0  3   
1 96
   0   76 32.00 123  3.84   0.00   0  0.00   0.00   0  0.00   0  0  3   
0 97
   0   76 32.00 122  3.81   0.00   0  0.00   0.00   0  0.00   1  0  2   
0 97
   0   76 32.00 135  4.21   0.00   0  0.00   0.00   0  0.00   0  0  2   
1 97
   0   76 32.00  97  3.03   0.00   0  0.00   0.00   0  0.00   0  0  3   
0 97

On Dec 29, 2003, at 2:39 PM, Tom Lane wrote:

Sean Shanny <[EMAIL PROTECTED]> writes:
Here is the pg_stats data.  The explain analyze queries are still  
running.

select * from pg_stats where tablename = 'f_pageviews' and attname =
'content_key';
 schemaname |  tablename  |   attname   | null_frac | avg_width |
n_distinct | most_common_vals |   most_common_freqs
|
histogram_bounds   | correlation
+-+-+---+--- 
++--+--- 
+- 
+-
 public | f_pageviews | content_key | 0 | 4 |
983 | {-1,1528483} | {0.749333,0.0017} |
Oh-ho, I see the problem: about 75% of your table has content_key = -1.

Why is that a problem, you ask?  Well, the planner realizes that
"content_key > -1" is a pretty good restriction condition (better than
the date condition, apparently) and so it tries to use that as the  
index
scan condition.  The problem is that in 7.4 and before, the btree index
code implements a "> -1" scan starting boundary by finding the first -1
and then advancing to the first key that's not -1.  So you end up
scanning through 75% of the index before anything useful happens :-(

I just fixed this poor behavior in CVS tip a couple weeks ago:
http://archives.postgresql.org/pgsql-committers/2003-12/msg00220.php
but the patch seems too large and unproven to risk back-patching into
7.4.*.
If you expect that a pretty large fraction of your data will always  
have
dummy content_key, it'd probably be worth changing the index to not
index -1's at all --- that is, make it a partial index with the
condition "WHERE content_key > -1".  Another workaround is to leave the
index as-is but phrase the query WHERE condition as "content_key >= 0"
instead of "> -1".

			regards, tom lane

---(end of  
broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


[PERFORM] DELETE ... WHERE ctid IN (...) vs. Iteration

2003-12-30 Thread Jonathan Gardner
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I guess this may have come up before, but now that 7.4 has the IN with 
improved performance, it may be time to revisit this topic.

Compare these two algorithms (in plpgsql):

(a)
DELETE FROM foo WHERE ctid IN (
SELECT foo.ctid
FROM ... WHERE ...
);

(b)
FOR result IN SELECT foo.ctid FROM ... WHERE ... LOOP
DELETE FROM foo WHERE ctid = result;
END LOOP;

My poor understanding of how the IN operator works leaves me to believe 
that for a large set of data in the IN group, a hash is used and a 
tablescan done on foo.  However, for a small set of data in the IN group, 
no tablescan is performed.

I assume that (a) works at O(ln(N)) for large N, and O(N) for small N, 
while (b) works at O(N) universally. Therefore, (a) is the superior 
algorithm. I welcome criticism and correction.

- -- 
Jonathan Gardner
[EMAIL PROTECTED]
Live Free, Use Linux!
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE/8aipWgwF3QvpWNwRAk8GAJoDWISjxG7LMB1FdCFmwlOafsmZTwCePx18
lyHLNBJ8nP0RHzv6WfRzQ+M=
=FPdW
-END PGP SIGNATURE-

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster