Re: [PERFORM] Table partitioning problem

2011-03-10 Thread Samba GUEYE

Hi jim thanks for your answer,

The database model is some' like that :
Measure(Id, numbering,Date, crcCorrect, sensorId) and a SimpleMeasure 
(Id, doubleValue) and GenericMeasure (Id, BlobValue, numberOfElements)
and in the UML model SimpleMeasure and GenericMeasure inherits from the 
Measure class so in the database, the foreign key of SimpleMeasure and 
GenericMeasure points to the Measure Table which is partitionned by sensor.


The measure insertion is successful but problems raise up when inserting 
in the simpleMeasure table because it can't find the foreign key 
inserted the measure table and do not look at the partitionned tables


ERROR:  insert or update on table "simpleMeasure" violates foreign key constraint 
"fk_measure_id"
DETAIL:  Key(measure_id)=(1) is not present in table Measure


The inheritance is just used to set the Postgre's partionning and the 
limitation of the partitioning comes from here


The same problem is also related in the following post :

http://archives.postgresql.org/pgsql-performance/2008-07/msg00224.php 
and this

http://archives.postgresql.org/pgsql-admin/2007-09/msg00031.php

Best Regards


Le 09/03/2011 23:01, Jim Nasby a écrit :

On Mar 8, 2011, at 9:45 AM, Samba GUEYE wrote:

I have a problem with table partitioning because i have a foreign key applied 
on the partionned table and it throw a constraint violation error during 
inserts.
I saw on the manual 
(http://www.postgresql.org/docs/8.4/interactive/ddl-inherit.html caveats 
section) that it's a limitation due to postgrsql table inheritance select 
queries performance are really bad without partitionning and i'm looking for a 
workaround to this foreign key problem or another solution for improve 
performance for larges tables.

Actually, this sounds more like having a foreign key pointed at a parent table 
in an inheritance tree; which flat-out doesn't do what you'd want.

Can you tell us what the foreign key constraint actually is, and what the 
inheritance setup for the tables in the FK is?
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance trouble finding records through related records

2011-03-10 Thread sverhagen
Hi, all. I've done some further analysis, found a form that works if I split
things over two separate queries (B1 and B2, below) but still trouble when
combining (B, below).

This is the full pseudo-query: SELECT FROM A UNION SELECT FROM B ORDER BY
dateTime DESC LIMIT 50
In that pseudo-query:
- A is fast (few ms). A is all events for the given customer
- B is slow (1 minute). B is all events for the same transactions as
all events for the given customer

Zooming in on B it looks originally as follows:

SELECT events2.id, events2.transactionId, events2.dateTime FROM
events_events events2
JOIN events_eventdetails details2_transKey
ON events2.id = details2_transKey.event_id
AND details2_transKey.keyname='transactionId'
JOIN events_eventdetails details2_transValue
ON substring(details2_transKey.value,0,32) =
substring(details2_transValue.value,0,32)
AND details2_transValue.keyname='transactionId'
JOIN events_eventdetails customerDetails
ON details2_transValue.event_id = customerDetails.event_id
AND customerDetails.keyname='customer_id'
AND substring(customerDetails.value,0,32)='598124'
WHERE events2.eventtype_id IN (100,103,105,... et cetera ...) 


The above version of B is tremendously slow.

The only fast version I've yet come to find is as follows:
- Do a sub-query B1
- Do a sub-query B2 with the results of B1

B1 looks as follows:
Works very fast (few ms)
http://explain.depesz.com/s/7JS

SELECT substring(details2_transValue.value,0,32)
FROM events_eventdetails_customer_id customerDetails
JOIN only events_eventdetails details2_transValue
USING (event_id)
WHERE customerDetails.keyname='customer_id'
AND substring(customerDetails.value,0,32)='49'
AND details2_transValue.keyname='transactionId'


B2 looks as follows:
Works very fast (few ms)
http://explain.depesz.com/s/jGO

SELECT events2.id, events2.dateTime
FROM events_events events2
JOIN events_eventdetails details2_transKey
ON events2.id = details2_transKey.event_id
AND details2_transKey.keyname='transactionId'
AND substring(details2_transKey.value,0,32) IN (... results of B1
...)
AND events2.eventtype_id IN
(100,103,105,106,45,34,14,87,58,78,7,76,11,25,57,98,30,35,33,49,52,28,85,59,23,22,51,48,36,65,66,18,13,86,75,44,38,43,94,56,95,96,71,50,81,90,89,16,17,88,79,77,68,97,92,67,72,53,2,10,31,32,80,24,93,26,9,8,61,5,73,70,63,20,60,40,41,39,101,104,107,99,64,62,55,69,19,46,47,15,21,27,54,12,102,108)

The combined version of B works slow again (3-10 seconds):
http://explain.depesz.com/s/9oM

SELECT events2.id, events2.dateTime
FROM events_events events2
JOIN events_eventdetails details2_transKey
ON events2.id = details2_transKey.event_id
AND details2_transKey.keyname='transactionId'
AND substring(details2_transKey.value,0,32) IN (
SELECT substring(details2_transValue.value,0,32)
FROM events_eventdetails_customer_id customerDetails
JOIN only events_eventdetails details2_transValue
USING (event_id)
WHERE customerDetails.keyname='customer_id'
AND substring(customerDetails.value,0,32)='49'
AND details2_transValue.keyname='transactionId')
AND events2.eventtype_id IN
(100,103,105,106,45,34,14,87,58,78,7,76,11,25,57,98,30,35,33,49,52,28,85,59,23,22,51,48,36,65,66,18,13,86,75,44,38,43,94,56,95,96,71,50,81,90,89,16,17,88,79,77,68,97,92,67,72,53,2,10,31,32,80,24,93,26,9,8,61,5,73,70,63,20,60,40,41,39,101,104,107,99,64,62,55,69,19,46,47,15,21,27,54,12,102,108)

At the moment I see not other conclusion than to offer B1 and B2 to the
database separately, but it feels like defeat :-|


--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Performance-trouble-finding-records-through-related-records-tp3405914p3423334.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] NULLS LAST performance

2011-03-10 Thread Robert Haas
On Wed, Mar 9, 2011 at 6:01 PM, Jim Nasby  wrote:
> Unfortunately, I don't think the planner actually has that level of knowledge.

Actually, I don't think it would be that hard to teach the planner
about that special case...

> A more reasonable fix might be to teach the executor that it can do 2 scans 
> of the index: one to get non-null data and a second to get null data. I don't 
> know if the use case is prevalent enough to warrant the extra code though.

That would probably be harder, but useful.  I thought about working on
it before but got sidetracked onto other things.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Tuning massive UPDATES and GROUP BY's?

2011-03-10 Thread fork
Given that doing a massive UPDATE SET foo = bar || ' ' || baz; on a 12 million
row table (with about 100 columns -- the US Census PUMS for the 2005-2009 ACS)
is never going to be that fast, what should one do to make it faster?

I set work_mem to 2048MB, but it currently is only using a little bit of memory
and CPU. (3% and 15% according to top; on a SELECT DISTINCT ... LIMIT earlier,
it was using 70% of the memory).

The data is not particularly sensitive; if something happened and it rolled
back, that wouldnt be the end of the world.  So I don't know if I can use
"dangerous" setting for WAL checkpoints etc.   There are also aren't a lot of
concurrent hits on the DB, though a few.

I am loathe to create a new table from a select, since the indexes themselves
take a really long time to build.

As the title alludes, I will also be doing GROUP BY's on the data, and would
love to speed these up, mostly just for my own impatience...  




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] NULLS LAST performance

2011-03-10 Thread Merlin Moncure
On Thu, Mar 10, 2011 at 9:55 AM, Robert Haas  wrote:
> On Wed, Mar 9, 2011 at 6:01 PM, Jim Nasby  wrote:
>> Unfortunately, I don't think the planner actually has that level of 
>> knowledge.
>
> Actually, I don't think it would be that hard to teach the planner
> about that special case...
>
>> A more reasonable fix might be to teach the executor that it can do 2 scans 
>> of the index: one to get non-null data and a second to get null data. I 
>> don't know if the use case is prevalent enough to warrant the extra code 
>> though.
>
> That would probably be harder, but useful.  I thought about working on
> it before but got sidetracked onto other things.

ISTM this isn't all that different from the case of composite indexes
where you are missing the left most term, or you have an index on
a,b,c (which the server already handles) but user asks for a,b desc,
c. If cardinality on b is low it might pay to loop and break up the
scan.

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?

2011-03-10 Thread Merlin Moncure
On Thu, Mar 10, 2011 at 9:40 AM, fork  wrote:
> Given that doing a massive UPDATE SET foo = bar || ' ' || baz; on a 12 million
> row table (with about 100 columns -- the US Census PUMS for the 2005-2009 ACS)
> is never going to be that fast, what should one do to make it faster?
>
> I set work_mem to 2048MB, but it currently is only using a little bit of 
> memory
> and CPU. (3% and 15% according to top; on a SELECT DISTINCT ... LIMIT earlier,
> it was using 70% of the memory).
>
> The data is not particularly sensitive; if something happened and it rolled
> back, that wouldnt be the end of the world.  So I don't know if I can use
> "dangerous" setting for WAL checkpoints etc.   There are also aren't a lot of
> concurrent hits on the DB, though a few.
>
> I am loathe to create a new table from a select, since the indexes themselves
> take a really long time to build.

you are aware that updating the field for the entire table, especially
if there is an index on it (or any field being updated), will cause
all your indexes to be rebuilt anyways?  when you update a record, it
gets a new position in the table, and a new index entry with that
position.  insert/select to temp, + truncate + insert/select back is
usually going to be faster and will save you the reindex/cluster.
otoh, if you have foreign keys it can be a headache.

> As the title alludes, I will also be doing GROUP BY's on the data, and would
> love to speed these up, mostly just for my own impatience...

need to see the query here to see if you can make them go faster.

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Basic performance tuning on dedicated server

2011-03-10 Thread runner

 

 I'm setting up my first PostgreSQL server to replace an existing MySQL server. 
 I've been reading Gregory Smith's book Postgres 9.0 High Performance and also 
Riggs/Krosing's PostgreSQL 9 Administration Cookbook.  While both of these 
books are excellent, I am completely new to PostgreSQL and I cannot possibly 
read and understand every aspect of tuning in the short amount time before I 
have to have this server running.

I started out by using the 11 step process for tuning a new dedicated server 
(page 145 in Gregory Smith's book) but I found I had more questions than I 
could get answered in the short amount of time I have.  So, plan B is to use 
pgtune to get a ballpark configuration and then fine tune later as I learn more.

I ran some performance tests where I imported my 11Gb database from our old 
MySQL server into PostgreSQL 9.0.3.  In my testing I left the postgresql.conf 
at default values.  The PostgreSQL test database completely blew away the old 
MySQL server in performance.  Again, the postgresql.conf was never optimized so 
I feel I will be OK if I just get in the ballpark with tuning the 
postgresql.conf file.

I'd like to run my plan by you guys to see if it seems sane and make sure I'm 
not leaving out anything major. 

I'll be running PostgreSQL 9.0.3 on a Solaris 10 64 bit (Sparc) box with 16G of 
RAM.The local file system is ZFS.  The database file systems are UFS and 
are SAN mounted from VERY fast disks with battery backed write cache.  I don't 
know anybody else who is running a mix of ZFS and UFS file systems,  I cannot 
change this.  ZFS has it's own file system cache so I'm concerned about the 
ramifications of having caches for both ZFS and UFS. The only database related 
files that are stored on the local ZFS file system are the PostgreSQL binaries 
and the system logs.

>From the extensive reading I've done, it seems generally accepted to set the 
>UFS file system cache to use 50% of the system RAM.  That leaves 8G left for 
>PostgreSQL.  Well, not really 8G,  I've reserved 1G for system use which 
>leaves me with 7G for PostgreSQL to use.  I ran pgtune and specified 7G as the 
>memory ( 7 * 1024 * 1024 = 7340032 ) and 300 connections.  The resulting 
>postgresql.conf is what I plan to use.   

After reading Gregory Smith's book, I've decided to put the database on one UFS 
file system, the WAL on a separate UFS file system (mounted with forcedirectio) 
and the archive logs on yet another UFS file system.  I'll be on Solaris 10 so 
I've set wal_sync_method = fdatasync based on recommendations from other 
Solaris users.  Did a lot of google searches on wal_sync_method and Solaris.

That's what I plan to go live with in a few days.  Since my test server with 
default configs already blows away the old database server, I think I can get 
away with this strategy.  Time is not on my side.

I originally installed the 32 bit PostgreSQL binaries but later switched to 64 
bit binaries.  I've read the 32 bit version is faster and uses less memory than 
the 64 bit version.  At this point I'm assuming I need the 64 bit binaries in 
order to take full advantage the the 7G of RAM I have allocated to PostgreSQL.  
If I am wrong here please let me know.

This has been a lot of information to cram down in the short amount of time 
I've had to deal with this project.  I'm going to have to go back and read the 
PostgreSQL 9.0 High Performance book two or three more times and really dig in 
to the details but for now I'm going to cheat and use pgtune as described 
above.  Thank you in advance for any advice or additional tips you may be able 
to provide.  

Rick






[PERFORM] unexpected stable function behavior

2011-03-10 Thread Julius Tuskenis

Hello, list

Our company is creating a ticketing system. Of course the performance 
issues are very important to us (as to all of you I guess). To increase 
speed of some queries stable functions are used, but somehow they don't 
act exactly as I expect, so would you please explain what am I doing (or 
expecting) wrong...


First of all I have the stable function witch runs fast and I have no 
problems with it at all.
CREATE OR REPLACE FUNCTION web_select_extra_price(prm_price_id integer, 
prm_event_id integer, prm_cashier_id integer)

  RETURNS numeric AS
'
 some code here
'
  LANGUAGE plpgsql STABLE
  COST 100;

Now the test:

1) query without using the function
explain analyze
  SELECT thtp_tick_id, price_id,
price_price,
price_color
  FROM ticket_price
JOIN ticket_has_ticket_price ON (price_id = thtp_price_id)
  WHERE price_event_id = 7820 and (current_timestamp >= price_date AND 
current_timestamp <= price_date_till)

  ORDER BY price_id;

Result:
"Sort  (cost=132.47..133.77 rows=518 width=25) (actual time=5.125..5.842 
rows=4335 loops=1)"

"  Sort Key: ticket_price.price_id"
"  Sort Method:  quicksort  Memory: 433kB"
"  ->  Nested Loop  (cost=0.00..109.12 rows=518 width=25) (actual 
time=0.037..3.148 rows=4335 loops=1)"
"->  Index Scan using index_price_event_id on ticket_price  
(cost=0.00..8.52 rows=2 width=21) (actual time=0.014..0.026 rows=7 loops=1)"

"  Index Cond: (price_event_id = 7820)"
"  Filter: ((now() >= price_date) AND (now() <= 
price_date_till))"
"->  Index Scan using idx_thtp_price_id on 
ticket_has_ticket_price  (cost=0.00..47.06 rows=259 width=8) (actual 
time=0.013..0.211 rows=619 loops=7)"
"  Index Cond: (ticket_has_ticket_price.thtp_price_id = 
ticket_price.price_id)"

"Total runtime: 6.425 ms"


2) Query using the function
explain analyze
  SELECT thtp_tick_id, price_id,
price_price, web_select_extra_price(price_id, price_event_id, 1),
price_color
  FROM ticket_price
JOIN ticket_has_ticket_price ON (price_id = thtp_price_id)
  WHERE price_event_id = 7820 and (current_timestamp >= price_date AND 
current_timestamp <= price_date_till)

  ORDER BY price_id;

Result:
"Sort  (cost=261.97..263.27 rows=518 width=29) (actual 
time=704.224..704.927 rows=4335 loops=1)"

"  Sort Key: ticket_price.price_id"
"  Sort Method:  quicksort  Memory: 433kB"
"  ->  Nested Loop  (cost=0.00..238.62 rows=518 width=29) (actual 
time=0.272..699.073 rows=4335 loops=1)"
"->  Index Scan using index_price_event_id on ticket_price  
(cost=0.00..8.52 rows=2 width=25) (actual time=0.011..0.052 rows=7 loops=1)"

"  Index Cond: (price_event_id = 7820)"
"  Filter: ((now() >= price_date) AND (now() <= 
price_date_till))"
"->  Index Scan using idx_thtp_price_id on 
ticket_has_ticket_price  (cost=0.00..47.06 rows=259 width=8) (actual 
time=0.017..0.582 rows=619 loops=7)"
"  Index Cond: (ticket_has_ticket_price.thtp_price_id = 
ticket_price.price_id)"

"Total runtime: 705.531 ms"


Now what you can think is that executing web_select_extra_price takes 
the difference, but
3) As STABLE function should be executed once for every different set of 
parameters I do

SELECT web_select_extra_price(price_id, 7820, 1) FROM (

  SELECT distinct price_id
  FROM ticket_price
JOIN ticket_has_ticket_price ON (price_id = thtp_price_id)
  WHERE price_event_id = 7820 and (current_timestamp >= price_date AND 
current_timestamp <= price_date_till)

 ) as qq;

Result:
"Subquery Scan on qq  (cost=110.34..110.88 rows=2 width=4) (actual 
time=7.265..8.907 rows=7 loops=1)"
"  ->  HashAggregate  (cost=110.34..110.36 rows=2 width=4) (actual 
time=6.866..6.873 rows=7 loops=1)"
"->  Nested Loop  (cost=0.00..109.05 rows=517 width=4) (actual 
time=0.037..4.643 rows=4335 loops=1)"
"  ->  Index Scan using index_price_event_id on 
ticket_price  (cost=0.00..8.52 rows=2 width=4) (actual time=0.014..0.038 
rows=7 loops=1)"

"Index Cond: (price_event_id = 7820)"
"Filter: ((now() >= price_date) AND (now() <= 
price_date_till))"
"  ->  Index Scan using idx_thtp_price_id on 
ticket_has_ticket_price  (cost=0.00..47.04 rows=258 width=4) (actual 
time=0.019..0.336 rows=619 loops=7)"
"Index Cond: (ticket_has_ticket_price.thtp_price_id 
= ticket_price.price_id)"

"Total runtime: 8.966 ms"


You can see the query has only 7 distinct parameter sets to pass to the 
function but...

4)   Explain analyze
  SELECT web_select_extra_price(price_id, 7820, 1)
  FROM ticket_price
JOIN ticket_has_ticket_price ON (price_id = thtp_price_id)
  WHERE price_event_id = 7820 and (current_timestamp >= price_date AND 
current_timestamp <= price_date_till)


Result:
"Nested Loop  (cost=0.00..238.30 rows=517 width=4) (actual 
time=0.365..808.537 rows=4335 loops=1)"
"  ->  Index Scan using index_price_event_id on ticket_price  
(cost=0.00..8.52 rows=2 w

Re: [PERFORM] Basic performance tuning on dedicated server

2011-03-10 Thread Merlin Moncure
On Thu, Mar 10, 2011 at 3:12 AM, runner  wrote:
>
> I'm setting up my first PostgreSQL server to replace an existing MySQL
> server.  I've been reading Gregory Smith's book Postgres 9.0 High
> Performance and also Riggs/Krosing's PostgreSQL 9 Administration Cookbook.
> While both of these books are excellent, I am completely new to PostgreSQL
> and I cannot possibly read and understand every aspect of tuning in the
> short amount time before I have to have this server running.
>
> I started out by using the 11 step process for tuning a new dedicated server
> (page 145 in Gregory Smith's book) but I found I had more questions than I
> could get answered in the short amount of time I have.  So, plan B is to use
> pgtune to get a ballpark configuration and then fine tune later as I learn
> more.
>
> I ran some performance tests where I imported my 11Gb database from our old
> MySQL server into PostgreSQL 9.0.3.  In my testing I left the
> postgresql.conf at default values.  The PostgreSQL test database completely
> blew away the old MySQL server in performance.  Again, the postgresql.conf
> was never optimized so I feel I will be OK if I just get in the ballpark
> with tuning the postgresql.conf file.
>
> I'd like to run my plan by you guys to see if it seems sane and make sure
> I'm not leaving out anything major.
>
> I'll be running PostgreSQL 9.0.3 on a Solaris 10 64 bit (Sparc) box with 16G
> of RAM.    The local file system is ZFS.  The database file systems are UFS
> and are SAN mounted from VERY fast disks with battery backed write cache.  I
> don't know anybody else who is running a mix of ZFS and UFS file systems,  I
> cannot change this.  ZFS has it's own file system cache so I'm concerned
> about the ramifications of having caches for both ZFS and UFS. The only
> database related files that are stored on the local ZFS file system are the
> PostgreSQL binaries and the system logs.
>
> From the extensive reading I've done, it seems generally accepted to set the
> UFS file system cache to use 50% of the system RAM.  That leaves 8G left for
> PostgreSQL.  Well, not really 8G,  I've reserved 1G for system use which
> leaves me with 7G for PostgreSQL to use.  I ran pgtune and specified 7G as
> the memory ( 7 * 1024 * 1024 = 7340032 ) and 300 connections.  The resulting
> postgresql.conf is what I plan to use.
>
> After reading Gregory Smith's book, I've decided to put the database on one
> UFS file system, the WAL on a separate UFS file system (mounted with
> forcedirectio) and the archive logs on yet another UFS file system.  I'll be
> on Solaris 10 so I've set wal_sync_method = fdatasync based on
> recommendations from other Solaris users.  Did a lot of google searches on
> wal_sync_method and Solaris.
>
> That's what I plan to go live with in a few days.  Since my test server with
> default configs already blows away the old database server, I think I can
> get away with this strategy.  Time is not on my side.
>
> I originally installed the 32 bit PostgreSQL binaries but later switched to
> 64 bit binaries.  I've read the 32 bit version is faster and uses less
> memory than the 64 bit version.  At this point I'm assuming I need the 64
> bit binaries in order to take full advantage the the 7G of RAM I have
> allocated to PostgreSQL.  If I am wrong here please let me know.
>
> This has been a lot of information to cram down in the short amount of time
> I've had to deal with this project.  I'm going to have to go back and read
> the PostgreSQL 9.0 High Performance book two or three more times and really
> dig in to the details but for now I'm going to cheat and use pgtune as
> described above.  Thank you in advance for any advice or additional tips you
> may be able to provide.


congratulations!

postgres memory tuning is a complicated topic but most it tends to be
very subtle in its effects or will apply to specific situations, like
dealing with i/o storms during checkpooints.  The only settings that
often need to be immediately cranked out of the box are
maintenance_work_mem and (much more carefully) work_mem.

Regardless how shared buffers is set, ALL of your server's memory goes
to postgres less what the o/s keeps for itself and other applications.
 You do not allocate memory to postgres -- you only suggest how it
might be used.   I stopped obsessing how it was set years ago.  In
fact, on linux for example dealing with background dirty page flushing
via the o/s (because stock settings can cause i/o storms) is a bigger
deal than shared_buffers by about an order of magnitude imnsho.

The non memory related settings of postgresql.conf, the planner
settings (join/from collapse limit, random_page_cost, etc), i/o
settings (fsync, wal_sync_method etc) are typically much more
important for performance than how memory is set up.

The reason postgres is showing up mysql is almost certainly due to the
query planner and/or (if you were using myisam) reaping the benefits
of mvcc.  My knowledge of mysql stops at

Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?

2011-03-10 Thread fork
Merlin Moncure  gmail.com> writes:

> > I am loathe to create a new table from a select, since the indexes 
> > themselves
> > take a really long time to build.
> 
> you are aware that updating the field for the entire table, especially
> if there is an index on it (or any field being updated), will cause
> all your indexes to be rebuilt anyways?  when you update a record, it
> gets a new position in the table, and a new index entry with that
> position.  
> insert/select to temp, + truncate + insert/select back is
> usually going to be faster and will save you the reindex/cluster.
> otoh, if you have foreign keys it can be a headache.

Hmph.  I guess I will have to find a way to automate it, since there will be a
lot of times I want to do this. 

> > As the title alludes, I will also be doing GROUP BY's on the data, and would
> > love to speed these up, mostly just for my own impatience...
> 
> need to see the query here to see if you can make them go faster.

I guess I was hoping for a blog entry on general guidelines given a DB that is
really only for batch analysis versus transaction processing.  Like "put all
your temp tables on a different disk" or whatever.  I will post specifics later.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] unexpected stable function behavior

2011-03-10 Thread Merlin Moncure
On Thu, Mar 10, 2011 at 10:26 AM, Julius Tuskenis  wrote:
> Hello, list
>
> Our company is creating a ticketing system. Of course the performance issues
> are very important to us (as to all of you I guess). To increase speed of
> some queries stable functions are used, but somehow they don't act exactly
> as I expect, so would you please explain what am I doing (or expecting)
> wrong...
>
> First of all I have the stable function witch runs fast and I have no
> problems with it at all.
> CREATE OR REPLACE FUNCTION web_select_extra_price(prm_price_id integer,
> prm_event_id integer, prm_cashier_id integer)
>  RETURNS numeric AS
> '
>  some code here
> '
>  LANGUAGE plpgsql STABLE
>  COST 100;
>
> Now the test:
>
> 1) query without using the function
> explain analyze
>  SELECT thtp_tick_id, price_id,
>    price_price,
>    price_color
>  FROM ticket_price
>    JOIN ticket_has_ticket_price ON (price_id = thtp_price_id)
>  WHERE price_event_id = 7820 and (current_timestamp >= price_date AND
> current_timestamp <= price_date_till)
>  ORDER BY price_id;
>
> Result:
> "Sort  (cost=132.47..133.77 rows=518 width=25) (actual time=5.125..5.842
> rows=4335 loops=1)"
> "  Sort Key: ticket_price.price_id"
> "  Sort Method:  quicksort  Memory: 433kB"
> "  ->  Nested Loop  (cost=0.00..109.12 rows=518 width=25) (actual
> time=0.037..3.148 rows=4335 loops=1)"
> "        ->  Index Scan using index_price_event_id on ticket_price
>  (cost=0.00..8.52 rows=2 width=21) (actual time=0.014..0.026 rows=7
> loops=1)"
> "              Index Cond: (price_event_id = 7820)"
> "              Filter: ((now() >= price_date) AND (now() <=
> price_date_till))"
> "        ->  Index Scan using idx_thtp_price_id on ticket_has_ticket_price
>  (cost=0.00..47.06 rows=259 width=8) (actual time=0.013..0.211 rows=619
> loops=7)"
> "              Index Cond: (ticket_has_ticket_price.thtp_price_id =
> ticket_price.price_id)"
> "Total runtime: 6.425 ms"
>
>
> 2) Query using the function
> explain analyze
>  SELECT thtp_tick_id, price_id,
>    price_price, web_select_extra_price(price_id, price_event_id, 1),
>    price_color
>  FROM ticket_price
>    JOIN ticket_has_ticket_price ON (price_id = thtp_price_id)
>  WHERE price_event_id = 7820 and (current_timestamp >= price_date AND
> current_timestamp <= price_date_till)
>  ORDER BY price_id;
>
> Result:
> "Sort  (cost=261.97..263.27 rows=518 width=29) (actual time=704.224..704.927
> rows=4335 loops=1)"
> "  Sort Key: ticket_price.price_id"
> "  Sort Method:  quicksort  Memory: 433kB"
> "  ->  Nested Loop  (cost=0.00..238.62 rows=518 width=29) (actual
> time=0.272..699.073 rows=4335 loops=1)"
> "        ->  Index Scan using index_price_event_id on ticket_price
>  (cost=0.00..8.52 rows=2 width=25) (actual time=0.011..0.052 rows=7
> loops=1)"
> "              Index Cond: (price_event_id = 7820)"
> "              Filter: ((now() >= price_date) AND (now() <=
> price_date_till))"
> "        ->  Index Scan using idx_thtp_price_id on ticket_has_ticket_price
>  (cost=0.00..47.06 rows=259 width=8) (actual time=0.017..0.582 rows=619
> loops=7)"
> "              Index Cond: (ticket_has_ticket_price.thtp_price_id =
> ticket_price.price_id)"
> "Total runtime: 705.531 ms"
>
>
> Now what you can think is that executing web_select_extra_price takes the
> difference, but
> 3) As STABLE function should be executed once for every different set of
> parameters I do
> SELECT web_select_extra_price(price_id, 7820, 1) FROM (
>
>  SELECT distinct price_id
>  FROM ticket_price
>    JOIN ticket_has_ticket_price ON (price_id = thtp_price_id)
>  WHERE price_event_id = 7820 and (current_timestamp >= price_date AND
> current_timestamp <= price_date_till)
>  ) as qq;
>
> Result:
> "Subquery Scan on qq  (cost=110.34..110.88 rows=2 width=4) (actual
> time=7.265..8.907 rows=7 loops=1)"
> "  ->  HashAggregate  (cost=110.34..110.36 rows=2 width=4) (actual
> time=6.866..6.873 rows=7 loops=1)"
> "        ->  Nested Loop  (cost=0.00..109.05 rows=517 width=4) (actual
> time=0.037..4.643 rows=4335 loops=1)"
> "              ->  Index Scan using index_price_event_id on ticket_price
>  (cost=0.00..8.52 rows=2 width=4) (actual time=0.014..0.038 rows=7 loops=1)"
> "                    Index Cond: (price_event_id = 7820)"
> "                    Filter: ((now() >= price_date) AND (now() <=
> price_date_till))"
> "              ->  Index Scan using idx_thtp_price_id on
> ticket_has_ticket_price  (cost=0.00..47.04 rows=258 width=4) (actual
> time=0.019..0.336 rows=619 loops=7)"
> "                    Index Cond: (ticket_has_ticket_price.thtp_price_id =
> ticket_price.price_id)"
> "Total runtime: 8.966 ms"
>
>
> You can see the query has only 7 distinct parameter sets to pass to the
> function but...
> 4)   Explain analyze
>  SELECT web_select_extra_price(price_id, 7820, 1)
>  FROM ticket_price
>    JOIN ticket_has_ticket_price ON (price_id = thtp_price_id)
>  WHERE price_event_id = 7820 and (current_timestamp >= price_date AND
> current_timestamp

[PERFORM] big joins not converging

2011-03-10 Thread Dan Ancona

Hi postgressers -

As part of my work with voter file data, I pretty regularly have to  
join one large-ish (over 500k rows) table to another. Sometimes this  
is via a text field (countyname) + integer (voter id). I've noticed  
sometimes this converges and sometimes it doesn't, seemingly  
regardless of how I index things. So I'm looking for general thoughts  
on the joining of large tables, but also running into a specific issue  
with the following slightly different query:


This one is between two tables that are a 754k row list of voters and  
a 445k row list of property owners. (I'm trying to find records where  
the owner name matches the voter name at the same address.) I have  
btree single column indices built on all the relevant fields, and  
multicolumn indices built across all the columns I'm matching. The  
full schemas of both tables are below. The machine is an older-ish (3  
years ago) dual-core pentium w/ 4GB RAM running FreeBSD, more details  
below.


This is the query I've come up with so far:

explain analyze
update vanalameda set ownerflag = 'exact'
  from aralameda where
  vanalameda.streetno ~~ aralameda.streetnum and
  vanalameda.streetname ~~ aralameda.streetname and
  vanalameda.lastname ~~ aralameda.ownername and
  vanalameda.firstname ~~ aralameda.ownername;

If I include the analyze, this didn't complete after running  
overnight. If I drop the analyze and just explain, I get this:


"Nested Loop  (cost=46690.74..15384448712.74 rows=204 width=204)"
"  Join Filter: (((vanalameda.streetno)::text ~~  
(aralameda.streetnum)::text) AND ((vanalameda.streetname)::text ~~  
(aralameda.streetname)::text) AND ((vanalameda.lastname)::text ~~  
(aralameda.ownername)::text) AND ((vanalameda.firstname)::text ~~  
(aralameda.ownername)::text))"
"  ->  Seq Scan on vanalameda  (cost=0.00..26597.80 rows=734780  
width=204)"

"  ->  Materialize  (cost=46690.74..58735.87 rows=444613 width=113)"
"->  Seq Scan on aralameda  (cost=0.00..38647.13 rows=444613  
width=113)"


One general question: does the width of the tables (i.e. the numbers  
of columns not being joined and the size of those fields) matter? The  
tables do have a lot of extra columns that I could slice out.


Thanks so much!

Dan

System:
client: pgadmin III, Mac OS

server:
select version();
PostgreSQL 8.3.7 on i386-portbld-freebsd7.2, compiled by GCC cc (GCC)  
4.2.1 20070719  [FreeBSD]

(installed from freebsd package system, default configuration)

%sysctl -a | egrep -i 'hw.machine|hw.model|hw.ncpu'
hw.machine: i386
hw.model: Genuine Intel(R) CPU2160  @ 1.80GHz
hw.ncpu: 2
hw.machine_arch: i386

w/ 4GB RAM, 1 1GB disk, no RAID.

Here's the tables...

  Table "public.aralameda"
 Column  | Type  | Modifiers
-+---+---
 dt000o039001010 | character varying(13) |
 o3901010| character varying(15) |
 dt17| character varying(2)  |
 dt046   | character varying(3)  |
 streetnum   | character varying(10) |
 streetname  | character varying(50) |
 unitnum | character varying(10) |
 city| character varying(30) |
 zip | character varying(5)  |
 unk3| character varying(1)  |
 crap1   | character varying(12) |
 crap2   | character varying(12) |
 crap3   | character varying(12) |
 crap4   | character varying(12) |
 crap5   | character varying(12) |
 crap6   | character varying(12) |
 crap7   | character varying(12) |
 crap8   | character varying(12) |
 crap9   | character varying(12) |
 crap10  | character varying(12) |
 dt2009  | character varying(4)  |
 dt066114| character varying(6)  |
 crap11  | character varying(8)  |
 crap12  | character varying(8)  |
 ownername   | character varying(50) |
 careofname  | character varying(50) |
 unk4| character varying(1)  |
 maddr1  | character varying(60) |
 munitnum| character varying(10) |
 mcitystate  | character varying(30) |
 mzip| character varying(5)  |
 mplus4  | character varying(4)  |
 dt40| character varying(2)  |
 dt4 | character varying(1)  |
 crap13  | character varying(8)  |
 d   | character varying(1)  |
 dt0500  | character varying(4)  |
 unk6| character varying(1)  |
 crap14  | character varying(8)  |
 unk7| character varying(1)  |
Indexes:
"arall" btree (streetnum, streetname, ownername)
"aroname" btree (ownername)
"arstreetname" btree (streetname)
"arstreetnum" btree (streetnum)

 Table "public.vanalameda"
Column | Type  | Modifiers
---+---+---
 vanid | character varying(8)  |
 lastname  | character varying(25) |
 firstname | character varying(16) |

Re: [PERFORM] big joins not converging

2011-03-10 Thread Steve Atkins

On Mar 10, 2011, at 1:25 PM, Dan Ancona wrote:

> Hi postgressers -
> 
> As part of my work with voter file data, I pretty regularly have to join one 
> large-ish (over 500k rows) table to another. Sometimes this is via a text 
> field (countyname) + integer (voter id). I've noticed sometimes this 
> converges and sometimes it doesn't, seemingly regardless of how I index 
> things. So I'm looking for general thoughts on the joining of large tables, 
> but also running into a specific issue with the following slightly different 
> query:
> 
> This one is between two tables that are a 754k row list of voters and a 445k 
> row list of property owners. (I'm trying to find records where the owner name 
> matches the voter name at the same address.) I have btree single column 
> indices built on all the relevant fields, and multicolumn indices built 
> across all the columns I'm matching. The full schemas of both tables are 
> below. The machine is an older-ish (3 years ago) dual-core pentium w/ 4GB RAM 
> running FreeBSD, more details below.
> 
> This is the query I've come up with so far:
> 
> explain analyze
> update vanalameda set ownerflag = 'exact'
>   from aralameda where
>   vanalameda.streetno ~~ aralameda.streetnum and
>   vanalameda.streetname ~~ aralameda.streetname and
>   vanalameda.lastname ~~ aralameda.ownername and
>   vanalameda.firstname ~~ aralameda.ownername;
> 
> If I include the analyze, this didn't complete after running overnight. If I 
> drop the analyze and just explain, I get this:
> 
> "Nested Loop  (cost=46690.74..15384448712.74 rows=204 width=204)"
> "  Join Filter: (((vanalameda.streetno)::text ~~ (aralameda.streetnum)::text) 
> AND ((vanalameda.streetname)::text ~~ (aralameda.streetname)::text) AND 
> ((vanalameda.lastname)::text ~~ (aralameda.ownername)::text) AND 
> ((vanalameda.firstname)::text ~~ (aralameda.ownername)::text))"
> "  ->  Seq Scan on vanalameda  (cost=0.00..26597.80 rows=734780 width=204)"
> "  ->  Materialize  (cost=46690.74..58735.87 rows=444613 width=113)"
> "->  Seq Scan on aralameda  (cost=0.00..38647.13 rows=444613 
> width=113)"
> 
> One general question: does the width of the tables (i.e. the numbers of 
> columns not being joined and the size of those fields) matter? The tables do 
> have a lot of extra columns that I could slice out.
> 

Is there any reason you're using '~~' to compare values, rather than '='?

If you're intentionally using LIKE-style comparisons then there are some other 
things you can do, but I don't think you mean to do that, for streeno and 
streetname anyway.

Switching to an equality comparison should let your query use an index, most 
usefully one on (streetname, streetnum) probably.

I'm not sure what you're intending by comparing ownername to both firstname and 
lastname. I don't think that'll do anything useful, and doubt it'll ever match. 
Are you expecting firstname and lastname to be substrings of ownername? If so, 
you might need to use wildcards with the like.

(Also, performance and smart use of indexes tends to get better in newer 
versions of postgresql. You might want to upgrade to 9.0.3 too.)

Cheers,
  Steve



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?

2011-03-10 Thread Marti Raudsepp
On Thu, Mar 10, 2011 at 17:40, fork  wrote:
> The data is not particularly sensitive; if something happened and it rolled
> back, that wouldnt be the end of the world.  So I don't know if I can use
> "dangerous" setting for WAL checkpoints etc.   There are also aren't a lot of
> concurrent hits on the DB, though a few.

If you don't mind long recovery times in case of a crash, set
checkpoint_segments to ~100 and checkpoint_completion_target=0.9; this
will improve write throughput significantly.

Also, if you don't mind CORRUPTing your database after a crash,
setting fsync=off and full_page_writes=off gives another significant
boost.

> I am loathe to create a new table from a select, since the indexes themselves
> take a really long time to build.

UPDATE on a table with many indexes will probably be slower. If you
want to speed up this part, use INSERT INTO x SELECT and take this
chance to partition your table, such that each individual partition
and most indexes will fit in your cache. Index builds from a warm
cache are very fast in PostgreSQL. You can create several indexes at
once in separate sessions, and the table will only be scanned once.

Don't forget to bump up maintenance_work_mem for index builds, 256MB
might be a reasonable arbitrary value.

The downside is that partitioning can interfere with your read queries
if they expect the data in a sorted order. But then, HashAggregate
tends to be faster than GroupAggregate in many cases, so this might
not matter for your queries. Alternatively you can experiment with
PostgreSQL 9.1 alpha, which has mostly fixed this shortcoming with the
"merge append" plan node.

> As the title alludes, I will also be doing GROUP BY's on the data, and would
> love to speed these up, mostly just for my own impatience...

I think regular tuning is the best you can do here.

Regards,
Marti

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] big joins not converging

2011-03-10 Thread fork
Steve Atkins  blighty.com> writes:

> 
> 
> On Mar 10, 2011, at 1:25 PM, Dan Ancona wrote:
> 
> > Hi postgressers -
> > 
> > As part of my work with voter file data, I pretty regularly have to join one
large-ish (over 500k rows) table
> to another. Sometimes this is via a text field (countyname) + integer (voter
id). I've noticed sometimes
> this converges and sometimes it doesn't, seemingly regardless of how I index
things. 

By "converge" you mean "finish running" -- "converge" has a lot of other
overtones for us amateur math types.

Note that I think you are doing "record linkage" which is a stepchild academic
of its own these days.  It might bear some research.  THere is also a CDC
matching program for text files freely downloadalbe to windows (ack), if you
hunt for it.

For now, my first thought is that you should try a few different matches, maybe
via PL/PGSQL functions, cascading the non-hits to the next step in the process
while shrinking your tables. upcase and delete all spaces, etc. First use
equality on all columns, which should be able to use indices, and separate those
records.  Then try equality on a few columns.  Then try some super fuzzy regexes
on a few columns.  Etc.  

You will also have to give some thought to scoring a match, with perfection a
one, but, say, name and birthday the same with all else different a .75, etc.

Also, soundex(), levenshtein, and other fuzzy string tools are your friend.  I
want to write a version of SAS's COMPGED for Postgres, but I haven't got round
to it yet.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] big joins not converging

2011-03-10 Thread Dan Ancona

On Mar 10, 2011, at 3:48 PM, fork wrote:
[much thoughtfulness]


Steve Atkins  blighty.com> writes:
[also much thoughtfulness]


Steve and fork -- thank you, this is super helpful. I meant to tweak  
that exact search before sending this around, sorry if that was  
confusing. That was meant to be a place holder for [some set of  
matches that works]. And yes, "not converging" was incorrect, I did  
mean "not finishing." But together from your answers it sounds pretty  
clear that there's no particularly obvious easy solution that I'm  
missing; this really is kind of tricky. This is a choice between  
developing some in-house capacity for this and sending people to  
various vendors so we'll probably lean on the vendors for now, at  
least while we work on it. I've gotten my head partway around PL/PGSQL  
functions, I may give that another try.


And you're right fork, Record Linkage is in fact an entire academic  
discipline! I had no idea, this is fascinating and helpful:


http://en.wikipedia.org/wiki/Record_linkage

Thanks so much!

Dan



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] NULLS LAST performance

2011-03-10 Thread Robert Haas
On Thu, Mar 10, 2011 at 11:32 AM, Merlin Moncure  wrote:
> On Thu, Mar 10, 2011 at 9:55 AM, Robert Haas  wrote:
>> On Wed, Mar 9, 2011 at 6:01 PM, Jim Nasby  wrote:
>>> Unfortunately, I don't think the planner actually has that level of 
>>> knowledge.
>>
>> Actually, I don't think it would be that hard to teach the planner
>> about that special case...
>>
>>> A more reasonable fix might be to teach the executor that it can do 2 scans 
>>> of the index: one to get non-null data and a second to get null data. I 
>>> don't know if the use case is prevalent enough to warrant the extra code 
>>> though.
>>
>> That would probably be harder, but useful.  I thought about working on
>> it before but got sidetracked onto other things.
>
> ISTM this isn't all that different from the case of composite indexes
> where you are missing the left most term, or you have an index on
> a,b,c (which the server already handles) but user asks for a,b desc,
> c. If cardinality on b is low it might pay to loop and break up the
> scan.

Yeah, there are a couple of refinements possible here.  One
possibility is that you might ask for ORDER BY a, b and the only
relevant index is on a.  In that case, it'd be a good idea to consider
scanning the index and sorting each equal group on b.  I've seen quite
a few queries that would benefit from this.  A second possibility is
that you might ask for ORDER BY a, b and the only relevant index is on
a, b DESC.  In that case, you could do three things:

- Scan the index and sort each group that's equal on a by b desc, just
as if the index were only on a.
- Scan the index and reverse each group.
- Scan the index in a funny order - for each value of a, find the
highest value of b and scan backwards until the a value changes; then
repeat for the next a-value.

And similarly with the case where you have ORDER BY a NULLS FIRST and
an index on a NULLS LAST, you could either:

- Detect when the column is NOT NULL and ignore the NULLS FIRST/LAST
property for purposes of matching the index in such cases, or
- Scan the index in a funny order - traverse the index to find the
first non-NULL entry at whichever end of the index has the nulls, go
from there to the end, and then "wrap around" to pick up the null
entries

The tricky part, at least IMO, is that you've got to not only teach
the planner to recognize these conditions when they occur, but also
find some way of passing it down to the index AM, which you also have
to modify to know how to do all this stuff.  The worst part about
making modifications of this type is that it's really hard to unit
test them - the planner, executor, and index AM all have to cooperate
before you can get off the ground.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance