Good information. I did forget to mention that I am using PostgreSQL 11.5. I 
also was not aware of the distinction between PROCEDURE and FUNCTION, so I 
guess I used the wrong terminology there when stating that new_value is the 
result of a stored procedure. It’s actually a function.

So would your suggestion then be to create a procedure that loops through the 
records, calculating and committing each one (or, as in your older Postgres 
example, batches of 10k to 20k)?

Good point on the HD I/O bound vs processor bound, but wouldn’t that depend on 
how complicated the actual update is? Still, there is a good chance you are 
correct in that statement, so that aspect is probably not worth spending too 
much time on.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

> On Jan 6, 2020, at 10:05 AM, Justin <zzzzz.g...@gmail.com> wrote:
> 
> There are several ways to actually do this
> 
> If you have Postgresql 11 or higher we now have Create Procedure  that allows 
> committing transactions,  one draw back is it can not parallel from inside 
> the procedure
> https://www.postgresql.org/docs/11/sql-createprocedure.html 
> <https://www.postgresql.org/docs/11/sql-createprocedure.html>
> https://severalnines.com/database-blog/overview-new-stored-procedures-postgresql-11
>  
> <https://severalnines.com/database-blog/overview-new-stored-procedures-postgresql-11>
> 
> If its an older version then Python Script or other scripting language to 
> iterates over the data say 10 to 20K will do what you want
> 
> for i in list of IDs 
> begin ;
> "UPDATE table_name SET changed_field=new_value()  where ID @> int4range(i, 
> i+10000);
> commit;
> 
> 
> To create parallel process simple Python script or other scripting language 
> can be used to create many connections working the data in parallel  but 
> given the simple update it will NOT help in performance,  this  will be Hard 
> disk IO bound,  not process bound where parallelization  helps 
> 
> 
> 
> On Mon, Jan 6, 2020 at 1:36 PM Israel Brewster <ijbrews...@alaska.edu 
> <mailto:ijbrews...@alaska.edu>> wrote:
> Thanks to a change in historical data, I have a need to update a large number 
> of records (around 50 million). The update itself is straight forward, as I 
> can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, 
> new_value is the result of a stored procedure, if that makes a difference) 
> command via psql, and it should work. However, due to the large number of 
> records this command will obviously take a while, and if anything goes wrong 
> during the update (one bad value in row 45 million, lost connection, etc), 
> all the work that has been done already will be lost due to the transactional 
> nature of such commands (unless I am missing something).
> 
> Given that each row update is completely independent of any other row, I have 
> the following questions:
> 
> 1) Is there any way to set the command such that each row change is committed 
> as it is calculated?
> 2) Is there some way to run this command in parallel in order to better 
> utilize multiple processor cores, other than manually breaking the data into 
> chunks and running a separate psql/update process for each chunk? Honestly, 
> manual parallelizing wouldn’t be too bad (there are a number of logical 
> segregations I can apply), I’m just wondering if there is a more automatic 
> option.
> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory 
> Geophysical Institute - UAF 
> 2156 Koyukuk Drive 
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell:  907-328-9145
> 

Reply via email to