> On Jan 6, 2020, at 11:38 AM, Christopher Browne <cbbro...@gmail.com> wrote: > > > > On Mon, Jan 6, 2020, 3:15 PM Israel Brewster <ijbrews...@alaska.edu > <mailto:ijbrews...@alaska.edu>> wrote: >> On Jan 6, 2020, at 10:08 AM, Christopher Browne <cbbro...@gmail.com >> <mailto:cbbro...@gmail.com>> wrote: >> >> On Mon, 6 Jan 2020 at 13:36, Israel Brewster <ijbrews...@alaska.edu >> <mailto:ijbrews...@alaska.edu>> wrote: >> Thanks to a change in historical data, I have a need to update a large >> number of records (around 50 million). The update itself is straight >> forward, as I can just issue an "UPDATE table_name SET >> changed_field=new_value();" (yes, new_value is the result of a stored >> procedure, if that makes a difference) command via psql, and it should work. >> However, due to the large number of records this command will obviously take >> a while, and if anything goes wrong during the update (one bad value in row >> 45 million, lost connection, etc), all the work that has been done already >> will be lost due to the transactional nature of such commands (unless I am >> missing something). >> >> Given that each row update is completely independent of any other row, I >> have the following questions: >> >> 1) Is there any way to set the command such that each row change is >> committed as it is calculated? >> 2) Is there some way to run this command in parallel in order to better >> utilize multiple processor cores, other than manually breaking the data into >> chunks and running a separate psql/update process for each chunk? Honestly, >> manual parallelizing wouldn’t be too bad (there are a number of logical >> segregations I can apply), I’m just wondering if there is a more automatic >> option. >> >> Yeah, I'd be inclined to do this in batches. >> >> If, for instance, the table has a nice primary key, then I'd capture the >> primary keys into a side table, and grab tuples from the side table to >> process in more bite-sized batches, say, of a few thousand tuples per batch. >> >> create table just_keys as select pk_column from big_historical_table; >> alter table just_keys add column processed boolean; >> create index jkpk on just_keys(pk_column) where (processed is null); >> then loop repeatedly along the lines... >> >> create temp table iteration as select pk_column from just_keys where >> processed is null limit 1000; >> [do update on big_historical_table where pk_column in (select pk_column from >> iteration)] >> update iteration set processed='true' where pk_column in (select pk_column >> from iteration); >> drop table iteration; >> >> Parallelization is absolutely an interesting idea; if you want to use 8 >> processes, then use a cycling sequence on the side table to spread tuples >> across the 8 processes, so that they can grab their own tuples and not block >> one another. >> >> In that case, more like... >> create temp sequence seq_procs start with 1 maxval 8 cycle; >> create temp table just_keys as select pk_column, false::boolean as >> processed, nextval('seq_procs') as batch_id from big_historical_table; >> >> The individual iterations then look for values in just_keys corresponding to >> their assigned batch number. > > Sounds like a reasonable approach. As Justin pointed out, it is actually > likely that the process will be IO bound rather than CPU bound, so my > parallel idea may not have much merit after all, but the batching procedure > makes sense. I assume you meant update just_keys in your sample rather than > update iteration on that line just before drop table iteration. Thanks for > the info > > As for parallelism, if you have really powerful disk, lots of disks on disk > array, it may help. Or not, as commented. > > I didn't test my wee bit of code, so yep, I meant to update just_keys :-). > > You won't find something terribly much more automatic. > > Oh, yah, there's a possible further complication; does the application need > to get stopped to do this update? Is the newest version of the app still > generating data that needs the rewriting? Sure hope not…
Yeah, a valid concern, but I should be ok on that front. Once I fix the calculation function, any new records will have the correct value. Plus, the actual update calculation is idempotent, so if a handful of new records end up getting re-calculated, that’s not an issue. Granted, the data will look weird while the re-calculation is in process (part new, part old), but we can live with that :-) --- Israel Brewster Software Engineer Alaska Volcano Observatory Geophysical Institute - UAF 2156 Koyukuk Drive Fairbanks AK 99775-7320 Work: 907-474-5172 cell: 907-328-9145