Re: [PERFORM] Help with tuning this query (with explain analyze finally)

2005-03-03 Thread Ken Egervari
2) Force PG to drop the merge join via SET ENABLE_MERGEJOIN = FALSE;
Actually, it was 312 milliseconds, so it got worse.
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


[PERFORM] name search query speed

2005-03-03 Thread Jeremiah Jahn
I have about 5M names stored on my DB. Currently the searches are very
quick unless, they are on a very common last name ie. SMITH. The Index
is always used, but I still hit 10-20 seconds on a SMITH or Jones
search, and I average about 6 searches a second and max out at about
30/s. Any suggestions on how I could arrange things to make this search
quicker? I have 4gb of mem on a raid 5 w/ 3 drives. I'm hoping that I
can increase this speed w/o a HW upgrade.

thanx,
-jj-



-- 
You probably wouldn't worry about what people think of you if you could
know how seldom they do.
-- Olin Miller.


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] name search query speed

2005-03-03 Thread Ken Egervari
I'm not sure what the answer is but maybe I can help?  Would clustering the 
name index make this faster? I thought that would bunch up the pages so the 
names were more or less in order, which would improve search time.  Just a 
guess though.

Ken
- Original Message - 
From: "Jeremiah Jahn" <[EMAIL PROTECTED]>
To: "postgres performance" 
Sent: Thursday, March 03, 2005 11:38 AM
Subject: [PERFORM] name search query speed


I have about 5M names stored on my DB. Currently the searches are very
quick unless, they are on a very common last name ie. SMITH. The Index
is always used, but I still hit 10-20 seconds on a SMITH or Jones
search, and I average about 6 searches a second and max out at about
30/s. Any suggestions on how I could arrange things to make this search
quicker? I have 4gb of mem on a raid 5 w/ 3 drives. I'm hoping that I
can increase this speed w/o a HW upgrade.
thanx,
-jj-

--
You probably wouldn't worry about what people think of you if you could
know how seldom they do.
   -- Olin Miller.
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] name search query speed

2005-03-03 Thread Jeremiah Jahn
yes, it does. I forgot to mention, that I also have clustering on that
table by my name_field index. My Bad.

On Thu, 2005-03-03 at 12:00 -0500, Ken Egervari wrote:
> I'm not sure what the answer is but maybe I can help?  Would clustering the 
> name index make this faster? I thought that would bunch up the pages so the 
> names were more or less in order, which would improve search time.  Just a 
> guess though.
> 
> Ken
> 
> - Original Message - 
> From: "Jeremiah Jahn" <[EMAIL PROTECTED]>
> To: "postgres performance" 
> Sent: Thursday, March 03, 2005 11:38 AM
> Subject: [PERFORM] name search query speed
> 
> 
> >I have about 5M names stored on my DB. Currently the searches are very
> > quick unless, they are on a very common last name ie. SMITH. The Index
> > is always used, but I still hit 10-20 seconds on a SMITH or Jones
> > search, and I average about 6 searches a second and max out at about
> > 30/s. Any suggestions on how I could arrange things to make this search
> > quicker? I have 4gb of mem on a raid 5 w/ 3 drives. I'm hoping that I
> > can increase this speed w/o a HW upgrade.
> >
> > thanx,
> > -jj-
> >
> >
> >
> > -- 
> > You probably wouldn't worry about what people think of you if you could
> > know how seldom they do.
> >-- Olin Miller.
> >
> >
> > ---(end of broadcast)---
> > TIP 6: Have you searched our list archives?
> >
> >   http://archives.postgresql.org
> > 
> 
> 
> ---(end of broadcast)---
> TIP 4: Don't 'kill -9' the postmaster
-- 
You probably wouldn't worry about what people think of you if you could
know how seldom they do.
-- Olin Miller.


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Help with tuning this query (with explain analyze finally)

2005-03-03 Thread Josh Berkus
Ken,

Well, I'm a bit stumped on troubleshooting the actual query since Windows' 
poor time resolution makes it impossible to trust the actual execution times.  
Obviously this is something we need to look into for the Win32 port for 
8.1 ..

> shared_buffers = 1000

This may be slowing up that merge join.  Try resetting it to 6000.I'm not 
sure what system settings you might have to do on Windows to get it to 
support higher shared buffers; see the docs.

> #work_mem = 1024

Up this to 4096 for testing purposes; your production value will vary 
depending on several factors; see link below.

> #effective_cache_size = 1000

Increase this to the actual amount of RAM you have available, about 750MB (you 
do the math)

> #random_page_cost = 4

Leave this for now.   

See www.powerpostgresql.com/PerfList for more information.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] name search query speed

2005-03-03 Thread John A Meinel
Jeremiah Jahn wrote:
I have about 5M names stored on my DB. Currently the searches are very
quick unless, they are on a very common last name ie. SMITH. The Index
is always used, but I still hit 10-20 seconds on a SMITH or Jones
search, and I average about 6 searches a second and max out at about
30/s. Any suggestions on how I could arrange things to make this search
quicker? I have 4gb of mem on a raid 5 w/ 3 drives. I'm hoping that I
can increase this speed w/o a HW upgrade.
thanx,
-jj-

 

It sounds like the problem is just that you have a lot of rows that need 
to be returned. Can you just put a limit on the query? And then change 
the client app to recognize when the limit is reached, and either give a 
link to more results, or refine query, or something like that.

John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] name search query speed

2005-03-03 Thread Josh Berkus
Jeremiah,

> I have about 5M names stored on my DB. Currently the searches are very
> quick unless, they are on a very common last name ie. SMITH. The Index
> is always used, but I still hit 10-20 seconds on a SMITH or Jones
> search, and I average about 6 searches a second and max out at about
> 30/s. Any suggestions on how I could arrange things to make this search
> quicker? I have 4gb of mem on a raid 5 w/ 3 drives. I'm hoping that I
> can increase this speed w/o a HW upgrade.

First off, see http://www.powerpostgresql.com/PerfList about your 
configuration settings.

The problem you're running into with SMITH is that, if your query is going to 
return a substantial number of rows (variable, but generally anything over 5% 
of the table and 1000 rows) is not able to make effective use of an index.
This makes it fall back on a sequential scan, and based on you execution 
time, I'd guess that the table is a bit too large to fit in memory.

AFTER you've made the configuration changes above, AND run VACUUM ANALYZE on 
your database, if you're still having problems post an EXPLAIN ANALYZE of the 
query to this list.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] name search query speed

2005-03-03 Thread Markus Schaber
Hi, Jeremiah,

Jeremiah Jahn schrieb:
> yes, it does. I forgot to mention, that I also have clustering on that
> table by my name_field index. My Bad.

Fine. Did you run ANALYZE and CLUSTER on the table after every large
bunch of insertions / updates?

Markus


-- 
Markus Schaber | Dipl. Informatiker | Software Development GIS

Fight against software patents in EU! http://ffii.org/
  http://nosoftwarepatents.org/

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] name search query speed

2005-03-03 Thread Jeremiah Jahn
On Thu, 2005-03-03 at 11:46 -0600, John A Meinel wrote:
> Jeremiah Jahn wrote:
> 
> >I have about 5M names stored on my DB. Currently the searches are very
> >quick unless, they are on a very common last name ie. SMITH. The Index
> >is always used, but I still hit 10-20 seconds on a SMITH or Jones
> >search, and I average about 6 searches a second and max out at about
> >30/s. Any suggestions on how I could arrange things to make this search
> >quicker? I have 4gb of mem on a raid 5 w/ 3 drives. I'm hoping that I
> >can increase this speed w/o a HW upgrade.
> >
> >thanx,
> >-jj-
> >
> >
> >
> >  
> >
> It sounds like the problem is just that you have a lot of rows that need 
> to be returned. Can you just put a limit on the query? And then change 
> the client app to recognize when the limit is reached, and either give a 
> link to more results, or refine query, or something like that.
Not really, about 2% of the returned rows are thrown away for security
reasons based on the current user, security groups they belong to and
different flags in the data itself. So the count for this is generated
on the fly needed for pagination in the app which expresses the total
number of finds, but only displays 40 of them. If any one knows a way to
determine the total number of matches without needing to iterate through
them using jdbc, I'm all ears as this would save me huge amounts of time
and limit/offset would become an option. 

> 
> John
> =:->
> 
-- 
"A power so great, it can only be used for Good or Evil!"
-- Firesign Theatre, "The Giant Rat of Summatra"


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] name search query speed

2005-03-03 Thread Jeremiah Jahn
On Thu, 2005-03-03 at 09:44 -0800, Josh Berkus wrote:
> Jeremiah,
> 
> > I have about 5M names stored on my DB. Currently the searches are very
> > quick unless, they are on a very common last name ie. SMITH. The Index
> > is always used, but I still hit 10-20 seconds on a SMITH or Jones
> > search, and I average about 6 searches a second and max out at about
> > 30/s. Any suggestions on how I could arrange things to make this search
> > quicker? I have 4gb of mem on a raid 5 w/ 3 drives. I'm hoping that I
> > can increase this speed w/o a HW upgrade.
> 
> First off, see http://www.powerpostgresql.com/PerfList about your 
> configuration settings.
> 
> The problem you're running into with SMITH is that, if your query is going to 
> return a substantial number of rows (variable, but generally anything over 5% 
> of the table and 1000 rows) is not able to make effective use of an index.
> This makes it fall back on a sequential scan, and based on you execution 
> time, I'd guess that the table is a bit too large to fit in memory.
> 
> AFTER you've made the configuration changes above, AND run VACUUM ANALYZE on 
> your database, if you're still having problems post an EXPLAIN ANALYZE of the 
> query to this list.
> 

ie. throw more hardware at it. All of the other things on the list,
except for effective_cache_size have always been done. I bumped it up
from the default to 260. Will see if that makes a difference.

thanx,
-jj-


-- 
"A power so great, it can only be used for Good or Evil!"
-- Firesign Theatre, "The Giant Rat of Summatra"


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] name search query speed

2005-03-03 Thread Dave Held
> -Original Message-
> From: Jeremiah Jahn [mailto:[EMAIL PROTECTED]
> Sent: Thursday, March 03, 2005 2:15 PM
> To: John A Meinel
> Cc: postgres performance
> Subject: Re: [PERFORM] name search query speed
> 
> [...]
> So the count for this is generated on the fly needed for
> pagination in the app which expresses the total number of
> finds, but only displays 40 of them. If any one knows a way
> to determine the total number of matches without needing to 
> iterate through them using jdbc, I'm all ears as this would
> save me huge amounts of time and limit/offset would become
> an option. 

Is there a reason you can't do a count(field) query first?  If
so, you can get the number of records returned by setting
absolute(-1) and getting the row number.

__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East,  Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] name search query speed

2005-03-03 Thread Jeremiah Jahn
doesn't that cause two queries? I used to do it that way and cut my time
substantially by counting in-line. Even though the results were cached
it still took more time. Also since the tables is constantly be updated
the returned total would not always match the number of results on the
second query.

On Thu, 2005-03-03 at 14:26 -0600, Dave Held wrote:
> > -Original Message-
> > From: Jeremiah Jahn [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, March 03, 2005 2:15 PM
> > To: John A Meinel
> > Cc: postgres performance
> > Subject: Re: [PERFORM] name search query speed
> > 
> > [...]
> > So the count for this is generated on the fly needed for
> > pagination in the app which expresses the total number of
> > finds, but only displays 40 of them. If any one knows a way
> > to determine the total number of matches without needing to 
> > iterate through them using jdbc, I'm all ears as this would
> > save me huge amounts of time and limit/offset would become
> > an option. 
> 
> Is there a reason you can't do a count(field) query first?  If
> so, you can get the number of records returned by setting
> absolute(-1) and getting the row number.
> 
> __
> David B. Held
> Software Engineer/Array Services Group
> 200 14th Ave. East,  Sartell, MN 56377
> 320.534.3637 320.253.7800 800.752.8129
-- 
"A power so great, it can only be used for Good or Evil!"
-- Firesign Theatre, "The Giant Rat of Summatra"


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] name search query speed

2005-03-03 Thread John Arbash Meinel
Jeremiah Jahn wrote:
On Thu, 2005-03-03 at 11:46 -0600, John A Meinel wrote:

...
Not really, about 2% of the returned rows are thrown away for security
reasons based on the current user, security groups they belong to and
different flags in the data itself. So the count for this is generated
on the fly needed for pagination in the app which expresses the total
number of finds, but only displays 40 of them. If any one knows a way to
determine the total number of matches without needing to iterate through
them using jdbc, I'm all ears as this would save me huge amounts of time
and limit/offset would become an option.

Well, what is wrong with "select count(*) from "?
Are you saying 2% are thrown away, or only 2% are kept?
Is this being done at the client side? Is there a way to incorporate the
security info into the database, so that the query actually only returns
the rows you care about? That seems like it would be a decent way to
speed it up, if you can restrict the number of rows that it needs to
look at.
There are other alternatives, such as materialized views, or temp
tables, where you select into the temp table the rows that the user
would request, and then you generate limit/offset from that. The first
query would be a little slow, since it would get all the rows, but all
subsequent accesses for that user could be really fast.
The other possibility is to do "limit 200", and then in your list of
pages, you could have:
1, 2, 3, 4, 5, ...
This means that you don't have to worry about getting 10,000 entries,
which probably isn't really useful for the user anyway, and you can
still break things into 40 entry pages, just 200 entries at a time.
John
=:->
John
=:->




signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] name search query speed

2005-03-03 Thread Markus Schaber
Hi, Jeremiah,

Jeremiah Jahn schrieb:
> doesn't that cause two queries? I used to do it that way and cut my time
> substantially by counting in-line. Even though the results were cached
> it still took more time.

This sounds rather strange.

> Also since the tables is constantly be updated
> the returned total would not always match the number of results on the
> second query.

Did you run both queries in the same transaction, with transaction
isolation level set to serializable? If yes, you found a serious bug in
PostgreSQL transaction engine.

Markus

-- 
Markus Schaber | Dipl. Informatiker | Software Development GIS

Fight against software patents in EU! http://ffii.org/
  http://nosoftwarepatents.org/

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Help with tuning this query (with explain analyze finally)

2005-03-03 Thread Ken Egervari
Josh,
I did everything you said and my query does perform a bit better.  I've been 
getting speeds from 203 to 219 to 234 milliseconds now.   I tried increasing 
the work mem and the effective cache size from the values you provided, but 
I didn't see any more improvement.  I've tried to looking into setting the 
shared buffers for Windows XP, but I'm not sure how to do it.  I'm looking 
in the manual at:
http://www.postgresql.org/docs/8.0/interactive/kernel-resources.html#SYSVIPC-PARAMETERS

It doesn't mention windows at all.  Does anyone have any ideas on have to 
fix this?

Here is the new explain analyze.
Sort  (cost=1996.21..1996.26 rows=17 width=165) (actual 
time=297.000..297.000 rows=39 loops=1)
 Sort Key: ss.date
 ->  Merge Right Join  (cost=1951.26..1995.87 rows=17 width=165) (actual 
time=297.000..297.000 rows=39 loops=1)
   Merge Cond: ("outer".id = "inner".driver_id)
   ->  Index Scan using driver_pkey on driver d  (cost=0.00..42.16 
rows=922 width=43) (actual time=0.000..0.000 rows=922 loops=1)
   ->  Sort  (cost=1951.26..1951.30 rows=17 width=122) (actual 
time=297.000..297.000 rows=39 loops=1)
 Sort Key: s.driver_id
 ->  Hash Join  (cost=586.48..1950.91 rows=17 width=122) 
(actual time=219.000..297.000 rows=39 loops=1)
   Hash Cond: ("outer".carrier_code_id = "inner".id)
   ->  Merge Join  (cost=571.97..1931.95 rows=830 width=87) 
(actual time=219.000..297.000 rows=310 loops=1)
 Merge Cond: ("outer".current_status_id = 
"inner".id)
 ->  Index Scan using 
shipment_current_status_id_idx on shipment s  (cost=0.00..2701.26 rows=60307 
width=66) (actual time=0.000..62.000 rows=27711 loops=1)
   Filter: (is_purged = false)
 ->  Sort  (cost=571.97..576.38 rows=1766 width=21) 
(actual time=125.000..156.000 rows=6902 loops=1)
   Sort Key: ss.id
   ->  Hash Join  (cost=1.11..476.72 rows=1766 
width=21) (actual time=0.000..93.000 rows=6902 loops=1)
 Hash Cond: ("outer".release_code_id = 
"inner".id)
 ->  Index Scan Backward using 
current_status_date_idx on shipment_status ss  (cost=0.00..387.35 rows=14122 
width=21) (actual time=0.000..16.000 rows=14925 loops=1)
   Index Cond: (date >= 
(('now'::text)::date - 31))
 ->  Hash  (cost=1.10..1.10 rows=1 
width=4) (actual time=0.000..0.000 rows=0 loops=1)
   ->  Seq Scan on release_code rc 
(cost=0.00..1.10 rows=1 width=4) (actual time=0.000..0.000 rows=1 loops=1)
 Filter: ((number)::text = 
'9'::text)
   ->  Hash  (cost=14.51..14.51 rows=2 width=35) (actual 
time=0.000..0.000 rows=0 loops=1)
 ->  Nested Loop  (cost=4.92..14.51 rows=2 
width=35) (actual time=0.000..0.000 rows=2 loops=1)
   ->  Index Scan using person_pkey on person p 
(cost=0.00..5.73 rows=1 width=4) (actual time=0.000..0.000 rows=1 loops=1)
 Index Cond: (id = 355)
   ->  Hash Join  (cost=4.92..8.75 rows=2 
width=39) (actual time=0.000..0.000 rows=2 loops=1)
 Hash Cond: ("outer".id = 
"inner".carrier_id)
 ->  Seq Scan on carrier c 
(cost=0.00..3.54 rows=54 width=4) (actual time=0.000..0.000 rows=54 loops=1)
 ->  Hash  (cost=4.92..4.92 rows=2 
width=43) (actual time=0.000..0.000 rows=0 loops=1)
   ->  Hash Join  (cost=3.04..4.92 
rows=2 width=43) (actual time=0.000..0.000 rows=2 loops=1)
 Hash Cond: 
("outer".carrier_id = "inner".carrier_id)
 ->  Seq Scan on 
carrier_code cc  (cost=0.00..1.57 rows=57 width=35) (actual 
time=0.000..0.000 rows=57 loops=1)
 ->  Hash  (cost=3.04..3.04 
rows=1 width=8) (actual time=0.000..0.000 rows=0 loops=1)
   ->  Index Scan using 
carrier_to_person_person_id_idx on carrier_to_person ctp  (cost=0.00..3.04 
rows=1 width=8) (actual time=0.000..0.000 rows=1 loops=1)
 Index Cond: 
(355 = person_id)
Total runtime: 297.000 ms 

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [PERFORM] Help with tuning this query (with explain analyze finally)

2005-03-03 Thread John Arbash Meinel
Ken Egervari wrote:
Josh,
I did everything you said and my query does perform a bit better.
I've been getting speeds from 203 to 219 to 234 milliseconds now.   I
tried increasing the work mem and the effective cache size from the
values you provided, but I didn't see any more improvement.  I've
tried to looking into setting the shared buffers for Windows XP, but
I'm not sure how to do it.  I'm looking in the manual at:
http://www.postgresql.org/docs/8.0/interactive/kernel-resources.html#SYSVIPC-PARAMETERS

You probably don't need to change anything for Windows. If you set
shared_buffers too high, then postgres won't start. If it is starting,
then you don't need to modify the OS to get more shared buffers. (For
instance, on my Mac, I can't get shared_buffers > 500 without changing
things, but on windows I run with 3000 and no modification).
It doesn't mention windows at all.  Does anyone have any ideas on have
to fix this?
Do you need the interior sort? It's taking ~93ms to get 7k rows from
shipment_status, and then another 30ms to sort them. This isn't a lot,
so it might be fine.
Also, did you ever try CLUSTER current_status_date_idx ON shipment_status.
This groups the rows in shipment_status by their status date, which
helps put items with the same date next to eachother. This may effect
other portions of the query, or other queries. Also, if you are
inserting sequentially, it would seem that the items would already be
naturally near eachother based on date.
The next big cost is having to merge the 28k rows with the fast hash
plan, which takes about 80ms.
I guess the biggest issue is that you are doing a lot of work, and it
takes time to do it. Also, I've noticed that this query is being run
with exactly the same data. Which is good to compare two methods. But
remember to test on multiple potential values. You might be better off
one way with this query, but much worse for a different dataset. I
noticed that this seems to have fewer rows than what postgres thinks the
*average* number would be. (It predicts 60k and you only get 28k rows).
If this query is performed a lot, and you can be okay with a slight
delay in updating, you could always switch to some sort of lazy
materialized view.
You could also always throw more hardware at it. :) If the
shipment_status is one of the bottlenecks, create a 4-disk raid10 and
move the table over.
I don't remember what your hardware is, but I don't remember it being a
quad opteron with 16GB ram, and 20 15k SCSI disks, with the transaction
log on a solid state disk. :)
Why do you need the query to be 30ms? ~250ms is still pretty fast. If
you are needing updates faster than that, you might look more into *why*
and then handle it from a higher level.
And naturally, the most important this is to test it under load. 250ms
is pretty good, but if under load it goes back to 6s, then we probably
should look for different alternatives. Also, what is the load that is
causing the problem? Is it that you have some other big seqscans which
are causing all of your tables to go out of cache?
Also, I believe I remember you saying that your production server is a
P4, is that a single P4? Because I know postgres prefers Opterons to
Pentium Xeons when in a multiprocessor machine. Look through the
archives about spinlocks and the context switch bug. (context storm,
etc). Plus, since opterons are 64-bit, you can throw a lot more RAM at
them. I believe opterons outperform xeons for the same cost, *and* you
can scale them up with extra ram.
But remember, the biggest bottleneck is almost *always* the I/O. So put
more & faster disks into the system first.
John
=:->
Here is the new explain analyze.
Sort  (cost=1996.21..1996.26 rows=17 width=165) (actual
time=297.000..297.000 rows=39 loops=1)
 Sort Key: ss.date
 ->  Merge Right Join  (cost=1951.26..1995.87 rows=17 width=165)
(actual time=297.000..297.000 rows=39 loops=1)
   Merge Cond: ("outer".id = "inner".driver_id)
   ->  Index Scan using driver_pkey on driver d  (cost=0.00..42.16
rows=922 width=43) (actual time=0.000..0.000 rows=922 loops=1)
   ->  Sort  (cost=1951.26..1951.30 rows=17 width=122) (actual
time=297.000..297.000 rows=39 loops=1)
 Sort Key: s.driver_id
 ->  Hash Join  (cost=586.48..1950.91 rows=17 width=122)
(actual time=219.000..297.000 rows=39 loops=1)
   Hash Cond: ("outer".carrier_code_id = "inner".id)
   ->  Merge Join  (cost=571.97..1931.95 rows=830
width=87) (actual time=219.000..297.000 rows=310 loops=1)
 Merge Cond: ("outer".current_status_id =
"inner".id)
 ->  Index Scan using
shipment_current_status_id_idx on shipment s  (cost=0.00..2701.26
rows=60307 width=66) (actual time=0.000..62.000 rows=27711 loops=1)
   Filter: (is_purged = false)
 ->  Sort  (cost=571.97..576.38 rows=1766
width=21) (actual time=125.000..156.000 rows=6902 loops=1)
   

Re: [PERFORM] name search query speed

2005-03-03 Thread William Yu
Jeremiah Jahn wrote:
I have about 5M names stored on my DB. Currently the searches are very
quick unless, they are on a very common last name ie. SMITH. The Index
is always used, but I still hit 10-20 seconds on a SMITH or Jones
search, and I average about 6 searches a second and max out at about
30/s. Any suggestions on how I could arrange things to make this search
quicker? I have 4gb of mem on a raid 5 w/ 3 drives. I'm hoping that I
can increase this speed w/o a HW upgrade.
If it's just "SMITH", the only fix is to throw more hardware at the 
problem. I've got my own database of medical providers & facilities in 
the millions and anytime somebody tries to search for MEDICAL FACILITY, 
it takes forever. I've tried every optimization possible but when you 
have 500K records with the word "MEDICAL" in it, what can you do? You've 
got to check all 500K records to see if it matches your criteria.

For multi-word searches, what I've found does work is to periodically 
generate stats on work frequencies and use those stats to search the 
least common words first. For example, if somebody enters "ALTABATES 
MEDICAL HOSPITAL", I can get the ~50 providers with ALTABATES in the 
name and then do a 2nd and 3rd pass to filter against MEDICAL and HOSPITAL.

---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [PERFORM] Help with tuning this query (with explain analyze finally)

2005-03-03 Thread Ken Egervari
Josh,
Thanks so much for your comments.  They are incredibly insightful and you 
clearly know your stuff.  It's so great that I'm able to learn so much from 
you.  I really appreciate it.

Do you need the interior sort? It's taking ~93ms to get 7k rows from
shipment_status, and then another 30ms to sort them. This isn't a lot,
so it might be fine.
Running the query without the sort doesn't actually improve performance 
unfortunately, which I find strange.  I think the analyze is giving bad 
feedback because taking all sorts out completely makes no difference in 
performance.  Dan Tow's book actually said the same thing... how sorting 
rarely takes up the bulk of the work.  Although I know you didn't like his 
book much, but I had observed that in my experience too.

Also, did you ever try CLUSTER current_status_date_idx ON shipment_status.
This groups the rows in shipment_status by their status date, which
helps put items with the same date next to eachother. This may effect
other portions of the query, or other queries. Also, if you are
inserting sequentially, it would seem that the items would already be
naturally near eachother based on date.
Yes, this was one of the first things I tried actually and it is currently 
clustered.  Since shipment status comes into our system at real time, the 
dates are more or less in order as well.

The next big cost is having to merge the 28k rows with the fast hash
plan, which takes about 80ms.
I guess the biggest issue is that you are doing a lot of work, and it
takes time to do it. Also, I've noticed that this query is being run
with exactly the same data. Which is good to compare two methods. But
remember to test on multiple potential values. You might be better off
one way with this query, but much worse for a different dataset. I
noticed that this seems to have fewer rows than what postgres thinks the
*average* number would be. (It predicts 60k and you only get 28k rows).
Well, the example where p.id = 355 is an above normal case where performance 
is typically bad.  If a user's company has very few shipments and 
shipment_status rows, performance isn't going to matter much and those 
queries usually perform much faster.  I really needed to tune this for the 
larger customers who do have thousands of rows for their entire company and 
will probably reach 6 digits by the end of next year.  For the person 355, 
they've only been on the system for 3 months and they already have 27700 
rows.  Even if this makes the smaller customers a bit slower, I think it's 
worth it if I can speed up cases like this, who all have very similar data 
distribution.

If this query is performed a lot, and you can be okay with a slight
delay in updating, you could always switch to some sort of lazy
materialized view.
I thought about this, but it's very important since shipment and 
shipment_status are both updated in real time 24/7/365.  I think I might be 
able to cache it within the application for 60 seconds at most, but it would 
make little difference since people tend to refresh within that time anyway. 
It's very important that real-time inforamtion exists though.

You could also always throw more hardware at it. :) If the
shipment_status is one of the bottlenecks, create a 4-disk raid10 and
move the table over.
I don't remember what your hardware is, but I don't remember it being a
quad opteron with 16GB ram, and 20 15k SCSI disks, with the transaction
log on a solid state disk. :)
That sounds like an awesome system.  I loved to have something like that. 
Unfortunately, the production server is just a single processor machine with 
1 GB ram.  I think throwing more disks at it is probably the best bet, 
moving the shipment and shipment_status tables over as you suggested. 
That's great advice.

Why do you need the query to be 30ms? ~250ms is still pretty fast. If
you are needing updates faster than that, you might look more into *why*
and then handle it from a higher level.
30ms is a good target, although I guess I was naive for setting that goal 
perhaps.  I've just taken queries that ran at 600ms and with 1 or 2 indexes, 
they went down to 15ms.

Let's say we have 200 users signed into the application at the same time. 
The application refreshes their shipment information automatically to make 
sure it's up to date on the user's screen.  The application will execute the 
query we are trying to tune every 60 seconds for most of these users.  Users 
can set the refresh time to be higher, but 60 is the lowest amount so I'm 
just assuming everyone has it at 60.

Anyway, if you have 200 users logged in, that's 200 queries in the 60 second 
period, which is about 3-4 queries every second.  As you can see, it's 
getting maxed out, and because of bad luck, the queries are bunched together 
and are being called at the same time, making 8-9 queries in the same second 
and that's where the performance is starting to degrade.  I just know that 
if I could get this down to 30 ms, or even 100