Hi Julian, Using this way to break up the queries, I am able to update about 1500 rows per minute which will take over 100 days to complete, so I need to figure out why this is slow, and if there is any faster way.
UPDATE table1 SET new_column = table1.new_column FROM table2 WHERE table1.row_id = table2.row_id AND table2.row_id >= $1 AND table2.row_id < $2. Here is the explain from that: QUERY PLAN ------------------------------------------------------------------------------------------ Nested Loop (cost=0.00..39.78 rows=1 width=121) -> Index Scan using table2_pkey on table2 (cost=0.00..19.88 rows=1 width=12) Index Cond: ((row_id >= $1) AND (row_id < $2)) -> Index Scan using table1_pkey on table1 (cost=0.00..19.90 rows=1 width=113) Index Cond: (table1.row_id = table2.row_id) (5 rows) Thanks, -- Joshua Rubin On Fri, Jul 16, 2010 at 1:05 PM, Joshua Rubin <jru...@esoft.com> wrote: > Hi Julian, > > Sorry for the slow response. I think I will need to chop up the query some > how, but have not yet found an efficient way to do that. row_id is the > primary key in both tables, so that might work. > > Here is the explain: > urls_jrubin_merged=# EXPLAIN UPDATE table1 SET row_id = table2.row_id FROM > table2 WHERE table1.row_id = table2.row_id; > QUERY PLAN > > > -------------------------------------------------------------------------------------- > Merge Join (cost=57257969.62..12983795937.97 rows=4308749788074 > width=121) > Merge Cond: (table2.row_id = table1.row_id) > -> Sort (cost=15885110.79..16029412.85 rows=288604128 width=8) > Sort Key: table2.row_id > -> Seq Scan on table2 (cost=0.00..2137231.26 rows=288604128 > width=8) > -> Materialize (cost=41372858.83..42105903.14 rows=293217725 > width=121) > -> Sort (cost=41372858.83..41519467.69 rows=293217725 width=121) > Sort Key: table1.row_id > -> Seq Scan on todo (cost=0.00..5922587.45 rows=293217725 > width=121) > (9 rows) > > > Thanks, > -- > Joshua Rubin > > > > On Tue, Jul 13, 2010 at 5:08 PM, Julian Mehnle <jul...@mehnle.net> wrote: > >> Joshua Rubin wrote: >> >> > I have two tables each with nearly 300M rows. There is a 1:1 >> > relationship between the two tables and they are almost always joined >> > together in queries. The first table has many columns, the second has >> > a foreign key to the primary key of the first table and one more >> > column. It is expected that for every row in table1, there is a >> > corresponding row in table2. We would like to just add the one column >> > to the first table and drop the second table to allow us to index this >> > extra column. >> > >> > This query would work after adding the column to the first table: >> > UPDATE table1 SET new_column = table2.new_column FROM table2 WHERE >> > table1.row_id = table2.row_id; >> > >> > However, this will take much too long, I have not successfully >> > completed this on our staging server after running it for 3+ days. >> >> Can you get the query plan (EXPLAIN) of the update query? My guess is the >> join cost scales superlinearly. >> >> You might be able to chop this up into smaller UPDATEs by limiting the >> rows to be updated in each round by the primary key. >> >> E.g.: >> >> UPDATE table1 SET new_column = table2.new_column FROM table2 >> WHERE >> table1.row_id = table2.row_id and >> table1.row_id >= 0e6 and table1.row_id < 1e6 and >> table2.row_id >= 0e6 and table2.row_id < 1e6; >> >> for a moving row_id window. >> >> This has helped me in the past with a similar scenario (where both tables >> were partitioned by the PK, but it would presumably still work in the >> unpartitioned case). >> >> -Julian >> > >