Good information. I did forget to mention that I am using PostgreSQL 11.5. I also was not aware of the distinction between PROCEDURE and FUNCTION, so I guess I used the wrong terminology there when stating that new_value is the result of a stored procedure. It’s actually a function.
So would your suggestion then be to create a procedure that loops through the records, calculating and committing each one (or, as in your older Postgres example, batches of 10k to 20k)? Good point on the HD I/O bound vs processor bound, but wouldn’t that depend on how complicated the actual update is? Still, there is a good chance you are correct in that statement, so that aspect is probably not worth spending too much time on. --- Israel Brewster Software Engineer Alaska Volcano Observatory Geophysical Institute - UAF 2156 Koyukuk Drive Fairbanks AK 99775-7320 Work: 907-474-5172 cell: 907-328-9145 > On Jan 6, 2020, at 10:05 AM, Justin <zzzzz.g...@gmail.com> wrote: > > There are several ways to actually do this > > If you have Postgresql 11 or higher we now have Create Procedure that allows > committing transactions, one draw back is it can not parallel from inside > the procedure > https://www.postgresql.org/docs/11/sql-createprocedure.html > <https://www.postgresql.org/docs/11/sql-createprocedure.html> > https://severalnines.com/database-blog/overview-new-stored-procedures-postgresql-11 > > <https://severalnines.com/database-blog/overview-new-stored-procedures-postgresql-11> > > If its an older version then Python Script or other scripting language to > iterates over the data say 10 to 20K will do what you want > > for i in list of IDs > begin ; > "UPDATE table_name SET changed_field=new_value() where ID @> int4range(i, > i+10000); > commit; > > > To create parallel process simple Python script or other scripting language > can be used to create many connections working the data in parallel but > given the simple update it will NOT help in performance, this will be Hard > disk IO bound, not process bound where parallelization helps > > > > On Mon, Jan 6, 2020 at 1:36 PM Israel Brewster <ijbrews...@alaska.edu > <mailto:ijbrews...@alaska.edu>> wrote: > Thanks to a change in historical data, I have a need to update a large number > of records (around 50 million). The update itself is straight forward, as I > can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, > new_value is the result of a stored procedure, if that makes a difference) > command via psql, and it should work. However, due to the large number of > records this command will obviously take a while, and if anything goes wrong > during the update (one bad value in row 45 million, lost connection, etc), > all the work that has been done already will be lost due to the transactional > nature of such commands (unless I am missing something). > > Given that each row update is completely independent of any other row, I have > the following questions: > > 1) Is there any way to set the command such that each row change is committed > as it is calculated? > 2) Is there some way to run this command in parallel in order to better > utilize multiple processor cores, other than manually breaking the data into > chunks and running a separate psql/update process for each chunk? Honestly, > manual parallelizing wouldn’t be too bad (there are a number of logical > segregations I can apply), I’m just wondering if there is a more automatic > option. > --- > Israel Brewster > Software Engineer > Alaska Volcano Observatory > Geophysical Institute - UAF > 2156 Koyukuk Drive > Fairbanks AK 99775-7320 > Work: 907-474-5172 > cell: 907-328-9145 >