> On Jan 6, 2020, at 11:38 AM, Christopher Browne <cbbro...@gmail.com> wrote:
> 
> 
> 
> On Mon, Jan 6, 2020, 3:15 PM Israel Brewster <ijbrews...@alaska.edu 
> <mailto:ijbrews...@alaska.edu>> wrote:
>> On Jan 6, 2020, at 10:08 AM, Christopher Browne <cbbro...@gmail.com 
>> <mailto:cbbro...@gmail.com>> wrote:
>> 
>> On Mon, 6 Jan 2020 at 13:36, Israel Brewster <ijbrews...@alaska.edu 
>> <mailto:ijbrews...@alaska.edu>> wrote:
>> Thanks to a change in historical data, I have a need to update a large 
>> number of records (around 50 million). The update itself is straight 
>> forward, as I can just issue an "UPDATE table_name SET 
>> changed_field=new_value();" (yes, new_value is the result of a stored 
>> procedure, if that makes a difference) command via psql, and it should work. 
>> However, due to the large number of records this command will obviously take 
>> a while, and if anything goes wrong during the update (one bad value in row 
>> 45 million, lost connection, etc), all the work that has been done already 
>> will be lost due to the transactional nature of such commands (unless I am 
>> missing something).
>> 
>> Given that each row update is completely independent of any other row, I 
>> have the following questions:
>> 
>> 1) Is there any way to set the command such that each row change is 
>> committed as it is calculated?
>> 2) Is there some way to run this command in parallel in order to better 
>> utilize multiple processor cores, other than manually breaking the data into 
>> chunks and running a separate psql/update process for each chunk? Honestly, 
>> manual parallelizing wouldn’t be too bad (there are a number of logical 
>> segregations I can apply), I’m just wondering if there is a more automatic 
>> option.
>> 
>> Yeah, I'd be inclined to do this in batches.
>> 
>> If, for instance, the table has a nice primary key, then I'd capture the 
>> primary keys into a side table, and grab tuples from the side table to 
>> process in more bite-sized batches, say, of a few thousand tuples per batch.
>> 
>> create table just_keys as select pk_column from big_historical_table;
>> alter table just_keys add column processed boolean;
>> create index jkpk on just_keys(pk_column) where (processed is null);
>> then loop repeatedly along the lines...
>> 
>> create temp table iteration as select pk_column from just_keys where 
>> processed is null limit 1000;
>> [do update on big_historical_table where pk_column in (select pk_column from 
>> iteration)]
>> update iteration set processed='true' where pk_column in (select pk_column 
>> from iteration);
>> drop table iteration;
>> 
>> Parallelization is absolutely an interesting idea; if you want to use 8 
>> processes, then use a cycling sequence on the side table to spread tuples 
>> across the 8 processes, so that they can grab their own tuples and not block 
>> one another.
>> 
>> In that case, more like...
>> create temp sequence seq_procs start with 1 maxval 8 cycle;
>> create temp table just_keys as select pk_column, false::boolean as 
>> processed, nextval('seq_procs') as batch_id from big_historical_table;
>> 
>> The individual iterations then look for values in just_keys corresponding to 
>> their assigned batch number.
> 
> Sounds like a reasonable approach. As Justin pointed out, it is actually 
> likely that the process will be IO bound rather than CPU bound, so my 
> parallel idea may not have much merit after all, but the batching procedure 
> makes sense. I assume you meant update just_keys in your sample rather than 
> update iteration on that line just before drop table iteration. Thanks for 
> the info
> 
> As for parallelism, if you have really powerful disk, lots of disks on disk 
> array, it may help.  Or not, as commented.
> 
> I didn't test my wee bit of code, so yep, I meant to update just_keys :-).
> 
> You won't find something terribly much more automatic.
> 
> Oh, yah, there's a possible further complication; does the application need 
> to get stopped to do this update?  Is the newest version of the app still 
> generating data that needs the rewriting?  Sure hope not…

Yeah, a valid concern, but I should be ok on that front. Once I fix the 
calculation function, any new records will have the correct value. Plus, the 
actual update calculation is idempotent, so if a handful of new records end up 
getting re-calculated, that’s not an issue. Granted, the data will look weird 
while the re-calculation is in process (part new, part old), but we can live 
with that :-)

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

Reply via email to