[GENERAL] bug in query planning?

2003-12-21 Thread Steven D.Arnold
I have a query which does not use column indexes that it should use.  I  
have discovered some interesting behaviors of Postgres which may  
indicate a bug in the database's query planning.

Take a look at the query below.  There is a btree index on both  
m.account_id and a.account_id.  Query (1) does not use the index on the  
messages table, instead opting for a full table scan, thus killing  
performance.  The messages table can contain potentially hundreds of  
thousands or millions of rows.  Even at 50,000, it's murder.

Query (2) below is the same query, but we reverse the order of the  
tables.  It's obviously not quite the same query semantically, even  
though in my case it should always produce the same result.  It is  
interesting to note that it uses the indexes tho.

Finally, query (3) below uses traditional joining (non-ANSI).  Indexes  
are correctly used in that query.  The suggestion is that Postgres does  
not correctly analyze queries using ANSI joins.  Indexes are  
occasionally skipped when they should be used.  This seems like a bug  
in Postgres.  I'm using version 7.3.4 of Postgres.

Thanks in advance for any comments...
steve
Query (1)
=
defender=# explain analyze
defender-# selectcount(message_id)
defender-# from  messages m
defender-# left join accounts a
defender-# onm.account_id::bigint = a.account_id::bigint
defender-# where a.email = '[EMAIL PROTECTED]';
 QUERY  
PLAN
 

 Aggregate  (cost=20461.10..20461.10 rows=1 width=47) (actual  
time=1420.09..1420.09 rows=1 loops=1)
   ->  Hash Join  (cost=30.77..20334.38 rows=50687 width=47) (actual  
time=0.51..1319.69 rows=51419 loops=1)
 Hash Cond: ("outer".account_id = "inner".account_id)
 Filter: ("inner".email = '[EMAIL PROTECTED]'::text)
 ->  Seq Scan on messages m  (cost=0.00..19289.87 rows=50687  
width=16) (actual time=0.06..703.89 rows=52541 loops=1)
 ->  Hash  (cost=30.76..30.76 rows=3 width=31) (actual  
time=0.40..0.40 rows=0 loops=1)
   ->  Index Scan using accounts_pkey on accounts a   
(cost=0.00..30.76 rows=3 width=31) (actual time=0.17..0.38 rows=3  
loops=1)
 Total runtime: 1420.25 msec
(8 rows)

Query (2)
=
defender=# explain analyze
defender-# selectcount(message_id)
defender-# from  accounts a
defender-# left join messages m
defender-# ona.account_id::bigint = m.account_id::bigint
defender-# where a.email = '[EMAIL PROTECTED]';

QUERY PLAN
 
 

 Aggregate  (cost=6806.54..6806.54 rows=1 width=24) (actual  
time=792.14..792.14 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..6764.30 rows=16896 width=24) (actual  
time=0.38..718.12 rows=51419 loops=1)
 ->  Index Scan using accounts_email on accounts a   
(cost=0.00..8.98 rows=1 width=8) (actual time=0.22..0.25 rows=1  
loops=1)
   Index Cond: (email = '[EMAIL PROTECTED]'::text)
 ->  Index Scan using messages_account_id on messages m   
(cost=0.00..6544.13 rows=16896 width=16) (actual time=0.15..593.15  
rows=51419 loops=1)
   Index Cond: ("outer".account_id = m.account_id)
 Total runtime: 792.33 msec
(7 rows)

Query (3)
=
defender=# explain analyze
defender-# selectcount(message_id)
defender-# from  messages m, accounts a
defender-# where m.account_id::bigint = a.account_id::bigint
defender-# and   a.email = '[EMAIL PROTECTED]';

QUERY PLAN
 
 

 Aggregate  (cost=6806.54..6806.54 rows=1 width=24) (actual  
time=782.30..782.30 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..6764.30 rows=16896 width=24) (actual  
time=0.33..708.52 rows=51422 loops=1)
 ->  Index Scan using accounts_email on accounts a   
(cost=0.00..8.98 rows=1 width=8) (actual time=0.15..0.18 rows=1  
loops=1)
   Index Cond: (email = '[EMAIL PROTECTED]'::text)
 ->  Index Scan using messages_account_id on messages m   
(cost=0.00..6544.13 rows=16896 width=16) (actual time=0.15..578.23  
rows=51422 loops=1)
   Index Cond: (m.account_id = "outer".account_id)
 Total runtime: 782.46 msec
(7 rows)

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [GENERAL] bug in query planning?

2003-12-22 Thread Steven D.Arnold
On Dec 21, 2003, at 11:47 PM, Tom Lane wrote:

"Steven D.Arnold" <[EMAIL PROTECTED]> writes:
Query (2) below is the same query, but we reverse the order of the
tables.  It's obviously not quite the same query semantically, even
though in my case it should always produce the same result.
Since it is in fact not the same query, I'm unclear on why you expect
it to produce the same plan.
What I expect is for both queries to use the index on the messages 
table!  Why is it not doing that?

FWIW, I believe that 7.4 will recognize that (1) and (3) are
semantically equivalent.
I will try 7.4 and report back.

steve

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [GENERAL] bug in query planning?

2003-12-26 Thread Steven D.Arnold
Thanks to all for the detailed replies.  I just wanted to let everyone 
know -- for future google searches as much as anything else -- that 
dumping the database, upgrading to 7.4.1 and reloading did solve the 
problem.  All the queries I mentioned now use the available indices, 
except for understandable cases such as the number of rows in a table 
being really small.

Thanks for the tip and thanks for the improvements in 7.4.1 that fixed 
this problem.

steve

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


[GENERAL] MVCC for massively parallel inserts

2004-01-05 Thread Steven D.Arnold
How good is Postgres' performance for massive simultaneous insertions 
into the same heavily-indexed table?  Are there any studies or 
benchmarks I can look at for that?

I understand Postgres uses MVCC rather than conventional locking, which 
makes it easier to do parallel inserts.  In my environment, I will have 
so many inserts that it is unworkable to have one machine do all the 
inserting -- it would max out the CPU of even a very powerful machine, 
and in any case I'd like to avoid spending that much money.  One option 
is to use a cluster of commodity Intel machines running Linux or one of 
the BSD's.  In many database environments, that wouldn't buy me much 
because only one machine could do inserts while all the others would be 
used for selects.  But I'm going to have tons of inserts and few 
selects.  So I really need many of the machines in the cluster to be 
able to simultaneously insert.  Is this practicable in a clustered 
environment for Postgres?

Thanks in advance for any insight or references,
steve
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings