Re: [PERFORM] Table partitioning problem
Hi jim thanks for your answer, The database model is some' like that : Measure(Id, numbering,Date, crcCorrect, sensorId) and a SimpleMeasure (Id, doubleValue) and GenericMeasure (Id, BlobValue, numberOfElements) and in the UML model SimpleMeasure and GenericMeasure inherits from the Measure class so in the database, the foreign key of SimpleMeasure and GenericMeasure points to the Measure Table which is partitionned by sensor. The measure insertion is successful but problems raise up when inserting in the simpleMeasure table because it can't find the foreign key inserted the measure table and do not look at the partitionned tables ERROR: insert or update on table "simpleMeasure" violates foreign key constraint "fk_measure_id" DETAIL: Key(measure_id)=(1) is not present in table Measure The inheritance is just used to set the Postgre's partionning and the limitation of the partitioning comes from here The same problem is also related in the following post : http://archives.postgresql.org/pgsql-performance/2008-07/msg00224.php and this http://archives.postgresql.org/pgsql-admin/2007-09/msg00031.php Best Regards Le 09/03/2011 23:01, Jim Nasby a écrit : On Mar 8, 2011, at 9:45 AM, Samba GUEYE wrote: I have a problem with table partitioning because i have a foreign key applied on the partionned table and it throw a constraint violation error during inserts. I saw on the manual (http://www.postgresql.org/docs/8.4/interactive/ddl-inherit.html caveats section) that it's a limitation due to postgrsql table inheritance select queries performance are really bad without partitionning and i'm looking for a workaround to this foreign key problem or another solution for improve performance for larges tables. Actually, this sounds more like having a foreign key pointed at a parent table in an inheritance tree; which flat-out doesn't do what you'd want. Can you tell us what the foreign key constraint actually is, and what the inheritance setup for the tables in the FK is? -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Performance trouble finding records through related records
Hi, all. I've done some further analysis, found a form that works if I split things over two separate queries (B1 and B2, below) but still trouble when combining (B, below). This is the full pseudo-query: SELECT FROM A UNION SELECT FROM B ORDER BY dateTime DESC LIMIT 50 In that pseudo-query: - A is fast (few ms). A is all events for the given customer - B is slow (1 minute). B is all events for the same transactions as all events for the given customer Zooming in on B it looks originally as follows: SELECT events2.id, events2.transactionId, events2.dateTime FROM events_events events2 JOIN events_eventdetails details2_transKey ON events2.id = details2_transKey.event_id AND details2_transKey.keyname='transactionId' JOIN events_eventdetails details2_transValue ON substring(details2_transKey.value,0,32) = substring(details2_transValue.value,0,32) AND details2_transValue.keyname='transactionId' JOIN events_eventdetails customerDetails ON details2_transValue.event_id = customerDetails.event_id AND customerDetails.keyname='customer_id' AND substring(customerDetails.value,0,32)='598124' WHERE events2.eventtype_id IN (100,103,105,... et cetera ...) The above version of B is tremendously slow. The only fast version I've yet come to find is as follows: - Do a sub-query B1 - Do a sub-query B2 with the results of B1 B1 looks as follows: Works very fast (few ms) http://explain.depesz.com/s/7JS SELECT substring(details2_transValue.value,0,32) FROM events_eventdetails_customer_id customerDetails JOIN only events_eventdetails details2_transValue USING (event_id) WHERE customerDetails.keyname='customer_id' AND substring(customerDetails.value,0,32)='49' AND details2_transValue.keyname='transactionId' B2 looks as follows: Works very fast (few ms) http://explain.depesz.com/s/jGO SELECT events2.id, events2.dateTime FROM events_events events2 JOIN events_eventdetails details2_transKey ON events2.id = details2_transKey.event_id AND details2_transKey.keyname='transactionId' AND substring(details2_transKey.value,0,32) IN (... results of B1 ...) AND events2.eventtype_id IN (100,103,105,106,45,34,14,87,58,78,7,76,11,25,57,98,30,35,33,49,52,28,85,59,23,22,51,48,36,65,66,18,13,86,75,44,38,43,94,56,95,96,71,50,81,90,89,16,17,88,79,77,68,97,92,67,72,53,2,10,31,32,80,24,93,26,9,8,61,5,73,70,63,20,60,40,41,39,101,104,107,99,64,62,55,69,19,46,47,15,21,27,54,12,102,108) The combined version of B works slow again (3-10 seconds): http://explain.depesz.com/s/9oM SELECT events2.id, events2.dateTime FROM events_events events2 JOIN events_eventdetails details2_transKey ON events2.id = details2_transKey.event_id AND details2_transKey.keyname='transactionId' AND substring(details2_transKey.value,0,32) IN ( SELECT substring(details2_transValue.value,0,32) FROM events_eventdetails_customer_id customerDetails JOIN only events_eventdetails details2_transValue USING (event_id) WHERE customerDetails.keyname='customer_id' AND substring(customerDetails.value,0,32)='49' AND details2_transValue.keyname='transactionId') AND events2.eventtype_id IN (100,103,105,106,45,34,14,87,58,78,7,76,11,25,57,98,30,35,33,49,52,28,85,59,23,22,51,48,36,65,66,18,13,86,75,44,38,43,94,56,95,96,71,50,81,90,89,16,17,88,79,77,68,97,92,67,72,53,2,10,31,32,80,24,93,26,9,8,61,5,73,70,63,20,60,40,41,39,101,104,107,99,64,62,55,69,19,46,47,15,21,27,54,12,102,108) At the moment I see not other conclusion than to offer B1 and B2 to the database separately, but it feels like defeat :-| -- View this message in context: http://postgresql.1045698.n5.nabble.com/Performance-trouble-finding-records-through-related-records-tp3405914p3423334.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] NULLS LAST performance
On Wed, Mar 9, 2011 at 6:01 PM, Jim Nasby wrote: > Unfortunately, I don't think the planner actually has that level of knowledge. Actually, I don't think it would be that hard to teach the planner about that special case... > A more reasonable fix might be to teach the executor that it can do 2 scans > of the index: one to get non-null data and a second to get null data. I don't > know if the use case is prevalent enough to warrant the extra code though. That would probably be harder, but useful. I thought about working on it before but got sidetracked onto other things. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Tuning massive UPDATES and GROUP BY's?
Given that doing a massive UPDATE SET foo = bar || ' ' || baz; on a 12 million row table (with about 100 columns -- the US Census PUMS for the 2005-2009 ACS) is never going to be that fast, what should one do to make it faster? I set work_mem to 2048MB, but it currently is only using a little bit of memory and CPU. (3% and 15% according to top; on a SELECT DISTINCT ... LIMIT earlier, it was using 70% of the memory). The data is not particularly sensitive; if something happened and it rolled back, that wouldnt be the end of the world. So I don't know if I can use "dangerous" setting for WAL checkpoints etc. There are also aren't a lot of concurrent hits on the DB, though a few. I am loathe to create a new table from a select, since the indexes themselves take a really long time to build. As the title alludes, I will also be doing GROUP BY's on the data, and would love to speed these up, mostly just for my own impatience... -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] NULLS LAST performance
On Thu, Mar 10, 2011 at 9:55 AM, Robert Haas wrote: > On Wed, Mar 9, 2011 at 6:01 PM, Jim Nasby wrote: >> Unfortunately, I don't think the planner actually has that level of >> knowledge. > > Actually, I don't think it would be that hard to teach the planner > about that special case... > >> A more reasonable fix might be to teach the executor that it can do 2 scans >> of the index: one to get non-null data and a second to get null data. I >> don't know if the use case is prevalent enough to warrant the extra code >> though. > > That would probably be harder, but useful. I thought about working on > it before but got sidetracked onto other things. ISTM this isn't all that different from the case of composite indexes where you are missing the left most term, or you have an index on a,b,c (which the server already handles) but user asks for a,b desc, c. If cardinality on b is low it might pay to loop and break up the scan. merlin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?
On Thu, Mar 10, 2011 at 9:40 AM, fork wrote: > Given that doing a massive UPDATE SET foo = bar || ' ' || baz; on a 12 million > row table (with about 100 columns -- the US Census PUMS for the 2005-2009 ACS) > is never going to be that fast, what should one do to make it faster? > > I set work_mem to 2048MB, but it currently is only using a little bit of > memory > and CPU. (3% and 15% according to top; on a SELECT DISTINCT ... LIMIT earlier, > it was using 70% of the memory). > > The data is not particularly sensitive; if something happened and it rolled > back, that wouldnt be the end of the world. So I don't know if I can use > "dangerous" setting for WAL checkpoints etc. There are also aren't a lot of > concurrent hits on the DB, though a few. > > I am loathe to create a new table from a select, since the indexes themselves > take a really long time to build. you are aware that updating the field for the entire table, especially if there is an index on it (or any field being updated), will cause all your indexes to be rebuilt anyways? when you update a record, it gets a new position in the table, and a new index entry with that position. insert/select to temp, + truncate + insert/select back is usually going to be faster and will save you the reindex/cluster. otoh, if you have foreign keys it can be a headache. > As the title alludes, I will also be doing GROUP BY's on the data, and would > love to speed these up, mostly just for my own impatience... need to see the query here to see if you can make them go faster. merlin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Basic performance tuning on dedicated server
I'm setting up my first PostgreSQL server to replace an existing MySQL server. I've been reading Gregory Smith's book Postgres 9.0 High Performance and also Riggs/Krosing's PostgreSQL 9 Administration Cookbook. While both of these books are excellent, I am completely new to PostgreSQL and I cannot possibly read and understand every aspect of tuning in the short amount time before I have to have this server running. I started out by using the 11 step process for tuning a new dedicated server (page 145 in Gregory Smith's book) but I found I had more questions than I could get answered in the short amount of time I have. So, plan B is to use pgtune to get a ballpark configuration and then fine tune later as I learn more. I ran some performance tests where I imported my 11Gb database from our old MySQL server into PostgreSQL 9.0.3. In my testing I left the postgresql.conf at default values. The PostgreSQL test database completely blew away the old MySQL server in performance. Again, the postgresql.conf was never optimized so I feel I will be OK if I just get in the ballpark with tuning the postgresql.conf file. I'd like to run my plan by you guys to see if it seems sane and make sure I'm not leaving out anything major. I'll be running PostgreSQL 9.0.3 on a Solaris 10 64 bit (Sparc) box with 16G of RAM.The local file system is ZFS. The database file systems are UFS and are SAN mounted from VERY fast disks with battery backed write cache. I don't know anybody else who is running a mix of ZFS and UFS file systems, I cannot change this. ZFS has it's own file system cache so I'm concerned about the ramifications of having caches for both ZFS and UFS. The only database related files that are stored on the local ZFS file system are the PostgreSQL binaries and the system logs. >From the extensive reading I've done, it seems generally accepted to set the >UFS file system cache to use 50% of the system RAM. That leaves 8G left for >PostgreSQL. Well, not really 8G, I've reserved 1G for system use which >leaves me with 7G for PostgreSQL to use. I ran pgtune and specified 7G as the >memory ( 7 * 1024 * 1024 = 7340032 ) and 300 connections. The resulting >postgresql.conf is what I plan to use. After reading Gregory Smith's book, I've decided to put the database on one UFS file system, the WAL on a separate UFS file system (mounted with forcedirectio) and the archive logs on yet another UFS file system. I'll be on Solaris 10 so I've set wal_sync_method = fdatasync based on recommendations from other Solaris users. Did a lot of google searches on wal_sync_method and Solaris. That's what I plan to go live with in a few days. Since my test server with default configs already blows away the old database server, I think I can get away with this strategy. Time is not on my side. I originally installed the 32 bit PostgreSQL binaries but later switched to 64 bit binaries. I've read the 32 bit version is faster and uses less memory than the 64 bit version. At this point I'm assuming I need the 64 bit binaries in order to take full advantage the the 7G of RAM I have allocated to PostgreSQL. If I am wrong here please let me know. This has been a lot of information to cram down in the short amount of time I've had to deal with this project. I'm going to have to go back and read the PostgreSQL 9.0 High Performance book two or three more times and really dig in to the details but for now I'm going to cheat and use pgtune as described above. Thank you in advance for any advice or additional tips you may be able to provide. Rick
[PERFORM] unexpected stable function behavior
Hello, list Our company is creating a ticketing system. Of course the performance issues are very important to us (as to all of you I guess). To increase speed of some queries stable functions are used, but somehow they don't act exactly as I expect, so would you please explain what am I doing (or expecting) wrong... First of all I have the stable function witch runs fast and I have no problems with it at all. CREATE OR REPLACE FUNCTION web_select_extra_price(prm_price_id integer, prm_event_id integer, prm_cashier_id integer) RETURNS numeric AS ' some code here ' LANGUAGE plpgsql STABLE COST 100; Now the test: 1) query without using the function explain analyze SELECT thtp_tick_id, price_id, price_price, price_color FROM ticket_price JOIN ticket_has_ticket_price ON (price_id = thtp_price_id) WHERE price_event_id = 7820 and (current_timestamp >= price_date AND current_timestamp <= price_date_till) ORDER BY price_id; Result: "Sort (cost=132.47..133.77 rows=518 width=25) (actual time=5.125..5.842 rows=4335 loops=1)" " Sort Key: ticket_price.price_id" " Sort Method: quicksort Memory: 433kB" " -> Nested Loop (cost=0.00..109.12 rows=518 width=25) (actual time=0.037..3.148 rows=4335 loops=1)" "-> Index Scan using index_price_event_id on ticket_price (cost=0.00..8.52 rows=2 width=21) (actual time=0.014..0.026 rows=7 loops=1)" " Index Cond: (price_event_id = 7820)" " Filter: ((now() >= price_date) AND (now() <= price_date_till))" "-> Index Scan using idx_thtp_price_id on ticket_has_ticket_price (cost=0.00..47.06 rows=259 width=8) (actual time=0.013..0.211 rows=619 loops=7)" " Index Cond: (ticket_has_ticket_price.thtp_price_id = ticket_price.price_id)" "Total runtime: 6.425 ms" 2) Query using the function explain analyze SELECT thtp_tick_id, price_id, price_price, web_select_extra_price(price_id, price_event_id, 1), price_color FROM ticket_price JOIN ticket_has_ticket_price ON (price_id = thtp_price_id) WHERE price_event_id = 7820 and (current_timestamp >= price_date AND current_timestamp <= price_date_till) ORDER BY price_id; Result: "Sort (cost=261.97..263.27 rows=518 width=29) (actual time=704.224..704.927 rows=4335 loops=1)" " Sort Key: ticket_price.price_id" " Sort Method: quicksort Memory: 433kB" " -> Nested Loop (cost=0.00..238.62 rows=518 width=29) (actual time=0.272..699.073 rows=4335 loops=1)" "-> Index Scan using index_price_event_id on ticket_price (cost=0.00..8.52 rows=2 width=25) (actual time=0.011..0.052 rows=7 loops=1)" " Index Cond: (price_event_id = 7820)" " Filter: ((now() >= price_date) AND (now() <= price_date_till))" "-> Index Scan using idx_thtp_price_id on ticket_has_ticket_price (cost=0.00..47.06 rows=259 width=8) (actual time=0.017..0.582 rows=619 loops=7)" " Index Cond: (ticket_has_ticket_price.thtp_price_id = ticket_price.price_id)" "Total runtime: 705.531 ms" Now what you can think is that executing web_select_extra_price takes the difference, but 3) As STABLE function should be executed once for every different set of parameters I do SELECT web_select_extra_price(price_id, 7820, 1) FROM ( SELECT distinct price_id FROM ticket_price JOIN ticket_has_ticket_price ON (price_id = thtp_price_id) WHERE price_event_id = 7820 and (current_timestamp >= price_date AND current_timestamp <= price_date_till) ) as qq; Result: "Subquery Scan on qq (cost=110.34..110.88 rows=2 width=4) (actual time=7.265..8.907 rows=7 loops=1)" " -> HashAggregate (cost=110.34..110.36 rows=2 width=4) (actual time=6.866..6.873 rows=7 loops=1)" "-> Nested Loop (cost=0.00..109.05 rows=517 width=4) (actual time=0.037..4.643 rows=4335 loops=1)" " -> Index Scan using index_price_event_id on ticket_price (cost=0.00..8.52 rows=2 width=4) (actual time=0.014..0.038 rows=7 loops=1)" "Index Cond: (price_event_id = 7820)" "Filter: ((now() >= price_date) AND (now() <= price_date_till))" " -> Index Scan using idx_thtp_price_id on ticket_has_ticket_price (cost=0.00..47.04 rows=258 width=4) (actual time=0.019..0.336 rows=619 loops=7)" "Index Cond: (ticket_has_ticket_price.thtp_price_id = ticket_price.price_id)" "Total runtime: 8.966 ms" You can see the query has only 7 distinct parameter sets to pass to the function but... 4) Explain analyze SELECT web_select_extra_price(price_id, 7820, 1) FROM ticket_price JOIN ticket_has_ticket_price ON (price_id = thtp_price_id) WHERE price_event_id = 7820 and (current_timestamp >= price_date AND current_timestamp <= price_date_till) Result: "Nested Loop (cost=0.00..238.30 rows=517 width=4) (actual time=0.365..808.537 rows=4335 loops=1)" " -> Index Scan using index_price_event_id on ticket_price (cost=0.00..8.52 rows=2 w
Re: [PERFORM] Basic performance tuning on dedicated server
On Thu, Mar 10, 2011 at 3:12 AM, runner wrote: > > I'm setting up my first PostgreSQL server to replace an existing MySQL > server. I've been reading Gregory Smith's book Postgres 9.0 High > Performance and also Riggs/Krosing's PostgreSQL 9 Administration Cookbook. > While both of these books are excellent, I am completely new to PostgreSQL > and I cannot possibly read and understand every aspect of tuning in the > short amount time before I have to have this server running. > > I started out by using the 11 step process for tuning a new dedicated server > (page 145 in Gregory Smith's book) but I found I had more questions than I > could get answered in the short amount of time I have. So, plan B is to use > pgtune to get a ballpark configuration and then fine tune later as I learn > more. > > I ran some performance tests where I imported my 11Gb database from our old > MySQL server into PostgreSQL 9.0.3. In my testing I left the > postgresql.conf at default values. The PostgreSQL test database completely > blew away the old MySQL server in performance. Again, the postgresql.conf > was never optimized so I feel I will be OK if I just get in the ballpark > with tuning the postgresql.conf file. > > I'd like to run my plan by you guys to see if it seems sane and make sure > I'm not leaving out anything major. > > I'll be running PostgreSQL 9.0.3 on a Solaris 10 64 bit (Sparc) box with 16G > of RAM. The local file system is ZFS. The database file systems are UFS > and are SAN mounted from VERY fast disks with battery backed write cache. I > don't know anybody else who is running a mix of ZFS and UFS file systems, I > cannot change this. ZFS has it's own file system cache so I'm concerned > about the ramifications of having caches for both ZFS and UFS. The only > database related files that are stored on the local ZFS file system are the > PostgreSQL binaries and the system logs. > > From the extensive reading I've done, it seems generally accepted to set the > UFS file system cache to use 50% of the system RAM. That leaves 8G left for > PostgreSQL. Well, not really 8G, I've reserved 1G for system use which > leaves me with 7G for PostgreSQL to use. I ran pgtune and specified 7G as > the memory ( 7 * 1024 * 1024 = 7340032 ) and 300 connections. The resulting > postgresql.conf is what I plan to use. > > After reading Gregory Smith's book, I've decided to put the database on one > UFS file system, the WAL on a separate UFS file system (mounted with > forcedirectio) and the archive logs on yet another UFS file system. I'll be > on Solaris 10 so I've set wal_sync_method = fdatasync based on > recommendations from other Solaris users. Did a lot of google searches on > wal_sync_method and Solaris. > > That's what I plan to go live with in a few days. Since my test server with > default configs already blows away the old database server, I think I can > get away with this strategy. Time is not on my side. > > I originally installed the 32 bit PostgreSQL binaries but later switched to > 64 bit binaries. I've read the 32 bit version is faster and uses less > memory than the 64 bit version. At this point I'm assuming I need the 64 > bit binaries in order to take full advantage the the 7G of RAM I have > allocated to PostgreSQL. If I am wrong here please let me know. > > This has been a lot of information to cram down in the short amount of time > I've had to deal with this project. I'm going to have to go back and read > the PostgreSQL 9.0 High Performance book two or three more times and really > dig in to the details but for now I'm going to cheat and use pgtune as > described above. Thank you in advance for any advice or additional tips you > may be able to provide. congratulations! postgres memory tuning is a complicated topic but most it tends to be very subtle in its effects or will apply to specific situations, like dealing with i/o storms during checkpooints. The only settings that often need to be immediately cranked out of the box are maintenance_work_mem and (much more carefully) work_mem. Regardless how shared buffers is set, ALL of your server's memory goes to postgres less what the o/s keeps for itself and other applications. You do not allocate memory to postgres -- you only suggest how it might be used. I stopped obsessing how it was set years ago. In fact, on linux for example dealing with background dirty page flushing via the o/s (because stock settings can cause i/o storms) is a bigger deal than shared_buffers by about an order of magnitude imnsho. The non memory related settings of postgresql.conf, the planner settings (join/from collapse limit, random_page_cost, etc), i/o settings (fsync, wal_sync_method etc) are typically much more important for performance than how memory is set up. The reason postgres is showing up mysql is almost certainly due to the query planner and/or (if you were using myisam) reaping the benefits of mvcc. My knowledge of mysql stops at
Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?
Merlin Moncure gmail.com> writes: > > I am loathe to create a new table from a select, since the indexes > > themselves > > take a really long time to build. > > you are aware that updating the field for the entire table, especially > if there is an index on it (or any field being updated), will cause > all your indexes to be rebuilt anyways? when you update a record, it > gets a new position in the table, and a new index entry with that > position. > insert/select to temp, + truncate + insert/select back is > usually going to be faster and will save you the reindex/cluster. > otoh, if you have foreign keys it can be a headache. Hmph. I guess I will have to find a way to automate it, since there will be a lot of times I want to do this. > > As the title alludes, I will also be doing GROUP BY's on the data, and would > > love to speed these up, mostly just for my own impatience... > > need to see the query here to see if you can make them go faster. I guess I was hoping for a blog entry on general guidelines given a DB that is really only for batch analysis versus transaction processing. Like "put all your temp tables on a different disk" or whatever. I will post specifics later. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] unexpected stable function behavior
On Thu, Mar 10, 2011 at 10:26 AM, Julius Tuskenis wrote: > Hello, list > > Our company is creating a ticketing system. Of course the performance issues > are very important to us (as to all of you I guess). To increase speed of > some queries stable functions are used, but somehow they don't act exactly > as I expect, so would you please explain what am I doing (or expecting) > wrong... > > First of all I have the stable function witch runs fast and I have no > problems with it at all. > CREATE OR REPLACE FUNCTION web_select_extra_price(prm_price_id integer, > prm_event_id integer, prm_cashier_id integer) > RETURNS numeric AS > ' > some code here > ' > LANGUAGE plpgsql STABLE > COST 100; > > Now the test: > > 1) query without using the function > explain analyze > SELECT thtp_tick_id, price_id, > price_price, > price_color > FROM ticket_price > JOIN ticket_has_ticket_price ON (price_id = thtp_price_id) > WHERE price_event_id = 7820 and (current_timestamp >= price_date AND > current_timestamp <= price_date_till) > ORDER BY price_id; > > Result: > "Sort (cost=132.47..133.77 rows=518 width=25) (actual time=5.125..5.842 > rows=4335 loops=1)" > " Sort Key: ticket_price.price_id" > " Sort Method: quicksort Memory: 433kB" > " -> Nested Loop (cost=0.00..109.12 rows=518 width=25) (actual > time=0.037..3.148 rows=4335 loops=1)" > " -> Index Scan using index_price_event_id on ticket_price > (cost=0.00..8.52 rows=2 width=21) (actual time=0.014..0.026 rows=7 > loops=1)" > " Index Cond: (price_event_id = 7820)" > " Filter: ((now() >= price_date) AND (now() <= > price_date_till))" > " -> Index Scan using idx_thtp_price_id on ticket_has_ticket_price > (cost=0.00..47.06 rows=259 width=8) (actual time=0.013..0.211 rows=619 > loops=7)" > " Index Cond: (ticket_has_ticket_price.thtp_price_id = > ticket_price.price_id)" > "Total runtime: 6.425 ms" > > > 2) Query using the function > explain analyze > SELECT thtp_tick_id, price_id, > price_price, web_select_extra_price(price_id, price_event_id, 1), > price_color > FROM ticket_price > JOIN ticket_has_ticket_price ON (price_id = thtp_price_id) > WHERE price_event_id = 7820 and (current_timestamp >= price_date AND > current_timestamp <= price_date_till) > ORDER BY price_id; > > Result: > "Sort (cost=261.97..263.27 rows=518 width=29) (actual time=704.224..704.927 > rows=4335 loops=1)" > " Sort Key: ticket_price.price_id" > " Sort Method: quicksort Memory: 433kB" > " -> Nested Loop (cost=0.00..238.62 rows=518 width=29) (actual > time=0.272..699.073 rows=4335 loops=1)" > " -> Index Scan using index_price_event_id on ticket_price > (cost=0.00..8.52 rows=2 width=25) (actual time=0.011..0.052 rows=7 > loops=1)" > " Index Cond: (price_event_id = 7820)" > " Filter: ((now() >= price_date) AND (now() <= > price_date_till))" > " -> Index Scan using idx_thtp_price_id on ticket_has_ticket_price > (cost=0.00..47.06 rows=259 width=8) (actual time=0.017..0.582 rows=619 > loops=7)" > " Index Cond: (ticket_has_ticket_price.thtp_price_id = > ticket_price.price_id)" > "Total runtime: 705.531 ms" > > > Now what you can think is that executing web_select_extra_price takes the > difference, but > 3) As STABLE function should be executed once for every different set of > parameters I do > SELECT web_select_extra_price(price_id, 7820, 1) FROM ( > > SELECT distinct price_id > FROM ticket_price > JOIN ticket_has_ticket_price ON (price_id = thtp_price_id) > WHERE price_event_id = 7820 and (current_timestamp >= price_date AND > current_timestamp <= price_date_till) > ) as qq; > > Result: > "Subquery Scan on qq (cost=110.34..110.88 rows=2 width=4) (actual > time=7.265..8.907 rows=7 loops=1)" > " -> HashAggregate (cost=110.34..110.36 rows=2 width=4) (actual > time=6.866..6.873 rows=7 loops=1)" > " -> Nested Loop (cost=0.00..109.05 rows=517 width=4) (actual > time=0.037..4.643 rows=4335 loops=1)" > " -> Index Scan using index_price_event_id on ticket_price > (cost=0.00..8.52 rows=2 width=4) (actual time=0.014..0.038 rows=7 loops=1)" > " Index Cond: (price_event_id = 7820)" > " Filter: ((now() >= price_date) AND (now() <= > price_date_till))" > " -> Index Scan using idx_thtp_price_id on > ticket_has_ticket_price (cost=0.00..47.04 rows=258 width=4) (actual > time=0.019..0.336 rows=619 loops=7)" > " Index Cond: (ticket_has_ticket_price.thtp_price_id = > ticket_price.price_id)" > "Total runtime: 8.966 ms" > > > You can see the query has only 7 distinct parameter sets to pass to the > function but... > 4) Explain analyze > SELECT web_select_extra_price(price_id, 7820, 1) > FROM ticket_price > JOIN ticket_has_ticket_price ON (price_id = thtp_price_id) > WHERE price_event_id = 7820 and (current_timestamp >= price_date AND > current_timestamp
[PERFORM] big joins not converging
Hi postgressers - As part of my work with voter file data, I pretty regularly have to join one large-ish (over 500k rows) table to another. Sometimes this is via a text field (countyname) + integer (voter id). I've noticed sometimes this converges and sometimes it doesn't, seemingly regardless of how I index things. So I'm looking for general thoughts on the joining of large tables, but also running into a specific issue with the following slightly different query: This one is between two tables that are a 754k row list of voters and a 445k row list of property owners. (I'm trying to find records where the owner name matches the voter name at the same address.) I have btree single column indices built on all the relevant fields, and multicolumn indices built across all the columns I'm matching. The full schemas of both tables are below. The machine is an older-ish (3 years ago) dual-core pentium w/ 4GB RAM running FreeBSD, more details below. This is the query I've come up with so far: explain analyze update vanalameda set ownerflag = 'exact' from aralameda where vanalameda.streetno ~~ aralameda.streetnum and vanalameda.streetname ~~ aralameda.streetname and vanalameda.lastname ~~ aralameda.ownername and vanalameda.firstname ~~ aralameda.ownername; If I include the analyze, this didn't complete after running overnight. If I drop the analyze and just explain, I get this: "Nested Loop (cost=46690.74..15384448712.74 rows=204 width=204)" " Join Filter: (((vanalameda.streetno)::text ~~ (aralameda.streetnum)::text) AND ((vanalameda.streetname)::text ~~ (aralameda.streetname)::text) AND ((vanalameda.lastname)::text ~~ (aralameda.ownername)::text) AND ((vanalameda.firstname)::text ~~ (aralameda.ownername)::text))" " -> Seq Scan on vanalameda (cost=0.00..26597.80 rows=734780 width=204)" " -> Materialize (cost=46690.74..58735.87 rows=444613 width=113)" "-> Seq Scan on aralameda (cost=0.00..38647.13 rows=444613 width=113)" One general question: does the width of the tables (i.e. the numbers of columns not being joined and the size of those fields) matter? The tables do have a lot of extra columns that I could slice out. Thanks so much! Dan System: client: pgadmin III, Mac OS server: select version(); PostgreSQL 8.3.7 on i386-portbld-freebsd7.2, compiled by GCC cc (GCC) 4.2.1 20070719 [FreeBSD] (installed from freebsd package system, default configuration) %sysctl -a | egrep -i 'hw.machine|hw.model|hw.ncpu' hw.machine: i386 hw.model: Genuine Intel(R) CPU2160 @ 1.80GHz hw.ncpu: 2 hw.machine_arch: i386 w/ 4GB RAM, 1 1GB disk, no RAID. Here's the tables... Table "public.aralameda" Column | Type | Modifiers -+---+--- dt000o039001010 | character varying(13) | o3901010| character varying(15) | dt17| character varying(2) | dt046 | character varying(3) | streetnum | character varying(10) | streetname | character varying(50) | unitnum | character varying(10) | city| character varying(30) | zip | character varying(5) | unk3| character varying(1) | crap1 | character varying(12) | crap2 | character varying(12) | crap3 | character varying(12) | crap4 | character varying(12) | crap5 | character varying(12) | crap6 | character varying(12) | crap7 | character varying(12) | crap8 | character varying(12) | crap9 | character varying(12) | crap10 | character varying(12) | dt2009 | character varying(4) | dt066114| character varying(6) | crap11 | character varying(8) | crap12 | character varying(8) | ownername | character varying(50) | careofname | character varying(50) | unk4| character varying(1) | maddr1 | character varying(60) | munitnum| character varying(10) | mcitystate | character varying(30) | mzip| character varying(5) | mplus4 | character varying(4) | dt40| character varying(2) | dt4 | character varying(1) | crap13 | character varying(8) | d | character varying(1) | dt0500 | character varying(4) | unk6| character varying(1) | crap14 | character varying(8) | unk7| character varying(1) | Indexes: "arall" btree (streetnum, streetname, ownername) "aroname" btree (ownername) "arstreetname" btree (streetname) "arstreetnum" btree (streetnum) Table "public.vanalameda" Column | Type | Modifiers ---+---+--- vanid | character varying(8) | lastname | character varying(25) | firstname | character varying(16) |
Re: [PERFORM] big joins not converging
On Mar 10, 2011, at 1:25 PM, Dan Ancona wrote: > Hi postgressers - > > As part of my work with voter file data, I pretty regularly have to join one > large-ish (over 500k rows) table to another. Sometimes this is via a text > field (countyname) + integer (voter id). I've noticed sometimes this > converges and sometimes it doesn't, seemingly regardless of how I index > things. So I'm looking for general thoughts on the joining of large tables, > but also running into a specific issue with the following slightly different > query: > > This one is between two tables that are a 754k row list of voters and a 445k > row list of property owners. (I'm trying to find records where the owner name > matches the voter name at the same address.) I have btree single column > indices built on all the relevant fields, and multicolumn indices built > across all the columns I'm matching. The full schemas of both tables are > below. The machine is an older-ish (3 years ago) dual-core pentium w/ 4GB RAM > running FreeBSD, more details below. > > This is the query I've come up with so far: > > explain analyze > update vanalameda set ownerflag = 'exact' > from aralameda where > vanalameda.streetno ~~ aralameda.streetnum and > vanalameda.streetname ~~ aralameda.streetname and > vanalameda.lastname ~~ aralameda.ownername and > vanalameda.firstname ~~ aralameda.ownername; > > If I include the analyze, this didn't complete after running overnight. If I > drop the analyze and just explain, I get this: > > "Nested Loop (cost=46690.74..15384448712.74 rows=204 width=204)" > " Join Filter: (((vanalameda.streetno)::text ~~ (aralameda.streetnum)::text) > AND ((vanalameda.streetname)::text ~~ (aralameda.streetname)::text) AND > ((vanalameda.lastname)::text ~~ (aralameda.ownername)::text) AND > ((vanalameda.firstname)::text ~~ (aralameda.ownername)::text))" > " -> Seq Scan on vanalameda (cost=0.00..26597.80 rows=734780 width=204)" > " -> Materialize (cost=46690.74..58735.87 rows=444613 width=113)" > "-> Seq Scan on aralameda (cost=0.00..38647.13 rows=444613 > width=113)" > > One general question: does the width of the tables (i.e. the numbers of > columns not being joined and the size of those fields) matter? The tables do > have a lot of extra columns that I could slice out. > Is there any reason you're using '~~' to compare values, rather than '='? If you're intentionally using LIKE-style comparisons then there are some other things you can do, but I don't think you mean to do that, for streeno and streetname anyway. Switching to an equality comparison should let your query use an index, most usefully one on (streetname, streetnum) probably. I'm not sure what you're intending by comparing ownername to both firstname and lastname. I don't think that'll do anything useful, and doubt it'll ever match. Are you expecting firstname and lastname to be substrings of ownername? If so, you might need to use wildcards with the like. (Also, performance and smart use of indexes tends to get better in newer versions of postgresql. You might want to upgrade to 9.0.3 too.) Cheers, Steve -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?
On Thu, Mar 10, 2011 at 17:40, fork wrote: > The data is not particularly sensitive; if something happened and it rolled > back, that wouldnt be the end of the world. So I don't know if I can use > "dangerous" setting for WAL checkpoints etc. There are also aren't a lot of > concurrent hits on the DB, though a few. If you don't mind long recovery times in case of a crash, set checkpoint_segments to ~100 and checkpoint_completion_target=0.9; this will improve write throughput significantly. Also, if you don't mind CORRUPTing your database after a crash, setting fsync=off and full_page_writes=off gives another significant boost. > I am loathe to create a new table from a select, since the indexes themselves > take a really long time to build. UPDATE on a table with many indexes will probably be slower. If you want to speed up this part, use INSERT INTO x SELECT and take this chance to partition your table, such that each individual partition and most indexes will fit in your cache. Index builds from a warm cache are very fast in PostgreSQL. You can create several indexes at once in separate sessions, and the table will only be scanned once. Don't forget to bump up maintenance_work_mem for index builds, 256MB might be a reasonable arbitrary value. The downside is that partitioning can interfere with your read queries if they expect the data in a sorted order. But then, HashAggregate tends to be faster than GroupAggregate in many cases, so this might not matter for your queries. Alternatively you can experiment with PostgreSQL 9.1 alpha, which has mostly fixed this shortcoming with the "merge append" plan node. > As the title alludes, I will also be doing GROUP BY's on the data, and would > love to speed these up, mostly just for my own impatience... I think regular tuning is the best you can do here. Regards, Marti -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] big joins not converging
Steve Atkins blighty.com> writes: > > > On Mar 10, 2011, at 1:25 PM, Dan Ancona wrote: > > > Hi postgressers - > > > > As part of my work with voter file data, I pretty regularly have to join one large-ish (over 500k rows) table > to another. Sometimes this is via a text field (countyname) + integer (voter id). I've noticed sometimes > this converges and sometimes it doesn't, seemingly regardless of how I index things. By "converge" you mean "finish running" -- "converge" has a lot of other overtones for us amateur math types. Note that I think you are doing "record linkage" which is a stepchild academic of its own these days. It might bear some research. THere is also a CDC matching program for text files freely downloadalbe to windows (ack), if you hunt for it. For now, my first thought is that you should try a few different matches, maybe via PL/PGSQL functions, cascading the non-hits to the next step in the process while shrinking your tables. upcase and delete all spaces, etc. First use equality on all columns, which should be able to use indices, and separate those records. Then try equality on a few columns. Then try some super fuzzy regexes on a few columns. Etc. You will also have to give some thought to scoring a match, with perfection a one, but, say, name and birthday the same with all else different a .75, etc. Also, soundex(), levenshtein, and other fuzzy string tools are your friend. I want to write a version of SAS's COMPGED for Postgres, but I haven't got round to it yet. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] big joins not converging
On Mar 10, 2011, at 3:48 PM, fork wrote: [much thoughtfulness] Steve Atkins blighty.com> writes: [also much thoughtfulness] Steve and fork -- thank you, this is super helpful. I meant to tweak that exact search before sending this around, sorry if that was confusing. That was meant to be a place holder for [some set of matches that works]. And yes, "not converging" was incorrect, I did mean "not finishing." But together from your answers it sounds pretty clear that there's no particularly obvious easy solution that I'm missing; this really is kind of tricky. This is a choice between developing some in-house capacity for this and sending people to various vendors so we'll probably lean on the vendors for now, at least while we work on it. I've gotten my head partway around PL/PGSQL functions, I may give that another try. And you're right fork, Record Linkage is in fact an entire academic discipline! I had no idea, this is fascinating and helpful: http://en.wikipedia.org/wiki/Record_linkage Thanks so much! Dan -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] NULLS LAST performance
On Thu, Mar 10, 2011 at 11:32 AM, Merlin Moncure wrote: > On Thu, Mar 10, 2011 at 9:55 AM, Robert Haas wrote: >> On Wed, Mar 9, 2011 at 6:01 PM, Jim Nasby wrote: >>> Unfortunately, I don't think the planner actually has that level of >>> knowledge. >> >> Actually, I don't think it would be that hard to teach the planner >> about that special case... >> >>> A more reasonable fix might be to teach the executor that it can do 2 scans >>> of the index: one to get non-null data and a second to get null data. I >>> don't know if the use case is prevalent enough to warrant the extra code >>> though. >> >> That would probably be harder, but useful. I thought about working on >> it before but got sidetracked onto other things. > > ISTM this isn't all that different from the case of composite indexes > where you are missing the left most term, or you have an index on > a,b,c (which the server already handles) but user asks for a,b desc, > c. If cardinality on b is low it might pay to loop and break up the > scan. Yeah, there are a couple of refinements possible here. One possibility is that you might ask for ORDER BY a, b and the only relevant index is on a. In that case, it'd be a good idea to consider scanning the index and sorting each equal group on b. I've seen quite a few queries that would benefit from this. A second possibility is that you might ask for ORDER BY a, b and the only relevant index is on a, b DESC. In that case, you could do three things: - Scan the index and sort each group that's equal on a by b desc, just as if the index were only on a. - Scan the index and reverse each group. - Scan the index in a funny order - for each value of a, find the highest value of b and scan backwards until the a value changes; then repeat for the next a-value. And similarly with the case where you have ORDER BY a NULLS FIRST and an index on a NULLS LAST, you could either: - Detect when the column is NOT NULL and ignore the NULLS FIRST/LAST property for purposes of matching the index in such cases, or - Scan the index in a funny order - traverse the index to find the first non-NULL entry at whichever end of the index has the nulls, go from there to the end, and then "wrap around" to pick up the null entries The tricky part, at least IMO, is that you've got to not only teach the planner to recognize these conditions when they occur, but also find some way of passing it down to the index AM, which you also have to modify to know how to do all this stuff. The worst part about making modifications of this type is that it's really hard to unit test them - the planner, executor, and index AM all have to cooperate before you can get off the ground. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance