Re: How to shorten a chain of logically replicated servers

2020-01-06 Thread Mike Lissner
Hi, I don't usually like to bump messages on this list, but since I
sent mine on New Year's Eve, I figured I'd better. Anybody have any
ideas about how to accomplish this? I'm pretty stumped (as you can
probably see).

On Tue, Dec 31, 2019 at 3:51 PM Mike Lissner
 wrote:
>
> Hi, I'm trying to figure out how to shorten a chain of logically
> replicating servers. Right now we have three servers replicating like
> so:
>
> A --> B --> C
>
> And I'd like to remove B from the chain of replication so that I only have:
>
> A --> C
>
> Of course, doing this without losing data is the goal. If the
> replication to C breaks temporarily, that's fine, so long as all the
> changes on A make it to C eventually.
>
> I'm not sure how to proceed with this. My best theory is:
>
> 1. In a transaction, DISABLE the replication from A to B and start a
> new PUBLICATION on A that C will subscribe to in step ③ below. The
> hope is that this will simultaneously stop sending changes to B while
> starting a log of new changes that can later be sent to C.
>
> 2. Let any changes queued on B flush to C. (How to know when they're
> all flushed?)
>
> 3. Subscribe C to the new PUBLICATION created in step ①. Create the
> subscription with copy_data=False. This should send all changes to C
> that hadn't been sent to B, without sending the complete tables.
>
> 4. DROP all replication to/from B (this is just cleanup; the incoming
> changes to B were disabled in step ①, and outgoing changes from B were
> flushed in step ②).
>
> Does this sound even close to the right approach? Logical replication
> can be a bit finicky, so I'd love to have some validation of the
> general approach before I go down this road.
>
> Thanks everybody and happy new year,
>
> Mike




UPDATE many records

2020-01-06 Thread Israel Brewster
Thanks to a change in historical data, I have a need to update a large number 
of records (around 50 million). The update itself is straight forward, as I can 
just issue an "UPDATE table_name SET changed_field=new_value();" (yes, 
new_value is the result of a stored procedure, if that makes a difference) 
command via psql, and it should work. However, due to the large number of 
records this command will obviously take a while, and if anything goes wrong 
during the update (one bad value in row 45 million, lost connection, etc), all 
the work that has been done already will be lost due to the transactional 
nature of such commands (unless I am missing something).

Given that each row update is completely independent of any other row, I have 
the following questions:

1) Is there any way to set the command such that each row change is committed 
as it is calculated?
2) Is there some way to run this command in parallel in order to better utilize 
multiple processor cores, other than manually breaking the data into chunks and 
running a separate psql/update process for each chunk? Honestly, manual 
parallelizing wouldn’t be too bad (there are a number of logical segregations I 
can apply), I’m just wondering if there is a more automatic option.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145



Re: UPDATE many records

2020-01-06 Thread Adrian Klaver

On 1/6/20 10:36 AM, Israel Brewster wrote:
Thanks to a change in historical data, I have a need to update a large 
number of records (around 50 million). The update itself is straight 
forward, as I can just issue an "UPDATE table_name SET 
changed_field=new_value();" (yes, new_value is the result of a stored 
procedure, if that makes a difference) command via psql, and it should 
work. However, due to the large number of records this command will 
obviously take a while, and if anything goes wrong during the update 
(one bad value in row 45 million, lost connection, etc), all the work 
that has been done already will be lost due to the transactional nature 
of such commands (unless I am missing something).


Given that each row update is completely independent of any other row, I 
have the following questions:


1) Is there any way to set the command such that each row change is 
committed as it is calculated?


Pretty sure:
UPDATE table_name SET changed_field=new_value();
is seen as a single statement and is all or none.

If you want to go row by row you will need to have the statement run on 
a row by row basis or maybe in batches.


2) Is there some way to run this command in parallel in order to better 
utilize multiple processor cores, other than manually breaking the data 
into chunks and running a separate psql/update process for each chunk? 
Honestly, manual parallelizing wouldn’t be too bad (there are a number 
of logical segregations I can apply), I’m just wondering if there is a 
more automatic option.


This is good time to ask what Postgres version?

I am still working out the recent parallel query system additions. Not 
sure if it applies to UPDATE or not.



---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145




--
Adrian Klaver
adrian.kla...@aklaver.com




Re: UPDATE many records

2020-01-06 Thread Christopher Browne
On Mon, 6 Jan 2020 at 13:36, Israel Brewster  wrote:

> Thanks to a change in historical data, I have a need to update a large
> number of records (around 50 million). The update itself is straight
> forward, as I can just issue an "UPDATE table_name SET
> changed_field=new_value();" (yes, new_value is the result of a stored
> procedure, if that makes a difference) command via psql, and it should
> work. However, due to the large number of records this command will
> obviously take a while, and if anything goes wrong during the update (one
> bad value in row 45 million, lost connection, etc), all the work that has
> been done already will be lost due to the transactional nature of such
> commands (unless I am missing something).
>
> Given that each row update is completely independent of any other row, I
> have the following questions:
>
> 1) Is there any way to set the command such that each row change is
> committed as it is calculated?
> 2) Is there some way to run this command in parallel in order to better
> utilize multiple processor cores, other than manually breaking the data
> into chunks and running a separate psql/update process for each chunk?
> Honestly, manual parallelizing wouldn’t be too bad (there are a number of
> logical segregations I can apply), I’m just wondering if there is a more
> automatic option.
>

Yeah, I'd be inclined to do this in batches.

If, for instance, the table has a nice primary key, then I'd capture the
primary keys into a side table, and grab tuples from the side table to
process in more bite-sized batches, say, of a few thousand tuples per batch.

create table just_keys as select pk_column from big_historical_table;
alter table just_keys add column processed boolean;
create index jkpk on just_keys(pk_column) where (processed is null);
then loop repeatedly along the lines...

create temp table iteration as select pk_column from just_keys where
processed is null limit 1000;
[do update on big_historical_table where pk_column in (select pk_column
from iteration)]
update iteration set processed='true' where pk_column in (select pk_column
from iteration);
drop table iteration;

Parallelization is absolutely an interesting idea; if you want to use 8
processes, then use a cycling sequence on the side table to spread tuples
across the 8 processes, so that they can grab their own tuples and not
block one another.

In that case, more like...
create temp sequence seq_procs start with 1 maxval 8 cycle;
create temp table just_keys as select pk_column, false::boolean as
processed, nextval('seq_procs') as batch_id from big_historical_table;

The individual iterations then look for values in just_keys corresponding
to their assigned batch number.

-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: UPDATE many records

2020-01-06 Thread Justin
There are several ways to actually do this

If you have Postgresql 11 or higher we now have Create Procedure  that
allows committing transactions,  one draw back is it can not parallel from
inside the procedure
https://www.postgresql.org/docs/11/sql-createprocedure.html
https://severalnines.com/database-blog/overview-new-stored-procedures-postgresql-11

If its an older version then Python Script or other scripting language to
iterates over the data say 10 to 20K will do what you want

for i in list of IDs
begin ;
"UPDATE table_name SET changed_field=new_value()  where ID @> int4range(i,
i+1);
commit;


To create parallel process simple Python script or other scripting language
can be used to create many connections working the data in parallel  but
given the simple update it will NOT help in performance,  this  will be
Hard disk IO bound,  not process bound where parallelization  helps



On Mon, Jan 6, 2020 at 1:36 PM Israel Brewster 
wrote:

> Thanks to a change in historical data, I have a need to update a large
> number of records (around 50 million). The update itself is straight
> forward, as I can just issue an "UPDATE table_name SET
> changed_field=new_value();" (yes, new_value is the result of a stored
> procedure, if that makes a difference) command via psql, and it should
> work. However, due to the large number of records this command will
> obviously take a while, and if anything goes wrong during the update (one
> bad value in row 45 million, lost connection, etc), all the work that has
> been done already will be lost due to the transactional nature of such
> commands (unless I am missing something).
>
> Given that each row update is completely independent of any other row, I
> have the following questions:
>
> 1) Is there any way to set the command such that each row change is
> committed as it is calculated?
> 2) Is there some way to run this command in parallel in order to better
> utilize multiple processor cores, other than manually breaking the data
> into chunks and running a separate psql/update process for each chunk?
> Honestly, manual parallelizing wouldn’t be too bad (there are a number of
> logical segregations I can apply), I’m just wondering if there is a more
> automatic option.
> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory
> Geophysical Institute - UAF
> 2156 Koyukuk Drive
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell:  907-328-9145
>
>


Re: UPDATE many records

2020-01-06 Thread Israel Brewster
Good information. I did forget to mention that I am using PostgreSQL 11.5. I 
also was not aware of the distinction between PROCEDURE and FUNCTION, so I 
guess I used the wrong terminology there when stating that new_value is the 
result of a stored procedure. It’s actually a function.

So would your suggestion then be to create a procedure that loops through the 
records, calculating and committing each one (or, as in your older Postgres 
example, batches of 10k to 20k)?

Good point on the HD I/O bound vs processor bound, but wouldn’t that depend on 
how complicated the actual update is? Still, there is a good chance you are 
correct in that statement, so that aspect is probably not worth spending too 
much time on.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

> On Jan 6, 2020, at 10:05 AM, Justin  wrote:
> 
> There are several ways to actually do this
> 
> If you have Postgresql 11 or higher we now have Create Procedure  that allows 
> committing transactions,  one draw back is it can not parallel from inside 
> the procedure
> https://www.postgresql.org/docs/11/sql-createprocedure.html 
> 
> https://severalnines.com/database-blog/overview-new-stored-procedures-postgresql-11
>  
> 
> 
> If its an older version then Python Script or other scripting language to 
> iterates over the data say 10 to 20K will do what you want
> 
> for i in list of IDs 
> begin ;
> "UPDATE table_name SET changed_field=new_value()  where ID @> int4range(i, 
> i+1);
> commit;
> 
> 
> To create parallel process simple Python script or other scripting language 
> can be used to create many connections working the data in parallel  but 
> given the simple update it will NOT help in performance,  this  will be Hard 
> disk IO bound,  not process bound where parallelization  helps 
> 
> 
> 
> On Mon, Jan 6, 2020 at 1:36 PM Israel Brewster  > wrote:
> Thanks to a change in historical data, I have a need to update a large number 
> of records (around 50 million). The update itself is straight forward, as I 
> can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, 
> new_value is the result of a stored procedure, if that makes a difference) 
> command via psql, and it should work. However, due to the large number of 
> records this command will obviously take a while, and if anything goes wrong 
> during the update (one bad value in row 45 million, lost connection, etc), 
> all the work that has been done already will be lost due to the transactional 
> nature of such commands (unless I am missing something).
> 
> Given that each row update is completely independent of any other row, I have 
> the following questions:
> 
> 1) Is there any way to set the command such that each row change is committed 
> as it is calculated?
> 2) Is there some way to run this command in parallel in order to better 
> utilize multiple processor cores, other than manually breaking the data into 
> chunks and running a separate psql/update process for each chunk? Honestly, 
> manual parallelizing wouldn’t be too bad (there are a number of logical 
> segregations I can apply), I’m just wondering if there is a more automatic 
> option.
> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory 
> Geophysical Institute - UAF 
> 2156 Koyukuk Drive 
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell:  907-328-9145
> 



Re: UPDATE many records

2020-01-06 Thread Israel Brewster
> On Jan 6, 2020, at 10:08 AM, Christopher Browne  wrote:
> 
> On Mon, 6 Jan 2020 at 13:36, Israel Brewster  > wrote:
> Thanks to a change in historical data, I have a need to update a large number 
> of records (around 50 million). The update itself is straight forward, as I 
> can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, 
> new_value is the result of a stored procedure, if that makes a difference) 
> command via psql, and it should work. However, due to the large number of 
> records this command will obviously take a while, and if anything goes wrong 
> during the update (one bad value in row 45 million, lost connection, etc), 
> all the work that has been done already will be lost due to the transactional 
> nature of such commands (unless I am missing something).
> 
> Given that each row update is completely independent of any other row, I have 
> the following questions:
> 
> 1) Is there any way to set the command such that each row change is committed 
> as it is calculated?
> 2) Is there some way to run this command in parallel in order to better 
> utilize multiple processor cores, other than manually breaking the data into 
> chunks and running a separate psql/update process for each chunk? Honestly, 
> manual parallelizing wouldn’t be too bad (there are a number of logical 
> segregations I can apply), I’m just wondering if there is a more automatic 
> option.
> 
> Yeah, I'd be inclined to do this in batches.
> 
> If, for instance, the table has a nice primary key, then I'd capture the 
> primary keys into a side table, and grab tuples from the side table to 
> process in more bite-sized batches, say, of a few thousand tuples per batch.
> 
> create table just_keys as select pk_column from big_historical_table;
> alter table just_keys add column processed boolean;
> create index jkpk on just_keys(pk_column) where (processed is null);
> then loop repeatedly along the lines...
> 
> create temp table iteration as select pk_column from just_keys where 
> processed is null limit 1000;
> [do update on big_historical_table where pk_column in (select pk_column from 
> iteration)]
> update iteration set processed='true' where pk_column in (select pk_column 
> from iteration);
> drop table iteration;
> 
> Parallelization is absolutely an interesting idea; if you want to use 8 
> processes, then use a cycling sequence on the side table to spread tuples 
> across the 8 processes, so that they can grab their own tuples and not block 
> one another.
> 
> In that case, more like...
> create temp sequence seq_procs start with 1 maxval 8 cycle;
> create temp table just_keys as select pk_column, false::boolean as processed, 
> nextval('seq_procs') as batch_id from big_historical_table;
> 
> The individual iterations then look for values in just_keys corresponding to 
> their assigned batch number.

Sounds like a reasonable approach. As Justin pointed out, it is actually likely 
that the process will be IO bound rather than CPU bound, so my parallel idea 
may not have much merit after all, but the batching procedure makes sense. I 
assume you meant update just_keys in your sample rather than update iteration 
on that line just before drop table iteration. Thanks for the info!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

> 
> -- 
> When confronted by a difficult problem, solve it by reducing it to the
> question, "How would the Lone Ranger handle this?"



Re: UPDATE many records

2020-01-06 Thread Alban Hertroys


> On 6 Jan 2020, at 21:15, Israel Brewster  wrote:
> 
>> On Jan 6, 2020, at 10:08 AM, Christopher Browne  wrote:
>> 
>> On Mon, 6 Jan 2020 at 13:36, Israel Brewster  wrote:
>> Thanks to a change in historical data, I have a need to update a large 
>> number of records (around 50 million). The update itself is straight 
>> forward, as I can just issue an "UPDATE table_name SET 
>> changed_field=new_value();" (yes, new_value is the result of a stored 
>> procedure, if that makes a difference) command via psql, and it should work. 
>> However, due to the large number of records this command will obviously take 
>> a while, and if anything goes wrong during the update (one bad value in row 
>> 45 million, lost connection, etc), all the work that has been done already 
>> will be lost due to the transactional nature of such commands (unless I am 
>> missing something).
>> 
>> Given that each row update is completely independent of any other row, I 
>> have the following questions:
>> 
>> 1) Is there any way to set the command such that each row change is 
>> committed as it is calculated?
>> 2) Is there some way to run this command in parallel in order to better 
>> utilize multiple processor cores, other than manually breaking the data into 
>> chunks and running a separate psql/update process for each chunk? Honestly, 
>> manual parallelizing wouldn’t be too bad (there are a number of logical 
>> segregations I can apply), I’m just wondering if there is a more automatic 
>> option.
>> 
>> Yeah, I'd be inclined to do this in batches.

I think you’re overcomplicating the matter.

I’d just do it as a single update in one transaction. It’s only 50M rows. It 
may take half an hour or so on decent hardware, depending on how 
resource-intensive your function is.

If that fails[1], only then would I start looking into batching things. But 
then you still need to figure out why it fails and what to do about that; if it 
fails it will probably fail fast, and if not, then you’re looking at a one-off 
situation that won’t require more than a few workarounds - after which you can 
just run the update again.

Ad 1). No harm has been done, it’s a single transaction that rolled back.

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.





Re: UPDATE many records

2020-01-06 Thread Justin
As you have access to Procedure,   you can create a loop then issue an
Begin Update Commit
so something like this should work plpgsql

declare
 icount int = 0;
 new_count int = 0;

begin

select count(*) into icount from mytable;

loop
  begin ;
Update mytable set myvalue = newvalue() where id  between new_count
and new_count+  ;
   commit;
new_count = new_count + 10,000;
   if new_count > icount  then
  break
   end if;
end loop;
end;


I am going to put caveat into this,  if newvalue() function is complex and
takes allot of "CPU cycles to do its thing" then parallelism would help,
unless this function looks at the table being updated it can really
complicate things as the parallel functions would be looking at stale
records which could be bad...






On Mon, Jan 6, 2020 at 3:07 PM Israel Brewster 
wrote:

> Good information. I did forget to mention that I am using PostgreSQL 11.5.
> I also was not aware of the distinction between PROCEDURE and FUNCTION, so
> I guess I used the wrong terminology there when stating that new_value is
> the result of a stored procedure. It’s actually a function.
>
> So would your suggestion then be to create a procedure that loops through
> the records, calculating and committing each one (or, as in your older
> Postgres example, batches of 10k to 20k)?
>
> Good point on the HD I/O bound vs processor bound, but wouldn’t that
> depend on how complicated the actual update is? Still, there is a good
> chance you are correct in that statement, so that aspect is probably not
> worth spending too much time on.
> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory
> Geophysical Institute - UAF
> 2156 Koyukuk Drive
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell:  907-328-9145
>
> On Jan 6, 2020, at 10:05 AM, Justin  wrote:
>
> There are several ways to actually do this
>
> If you have Postgresql 11 or higher we now have Create Procedure  that
> allows committing transactions,  one draw back is it can not parallel from
> inside the procedure
> https://www.postgresql.org/docs/11/sql-createprocedure.html
>
> https://severalnines.com/database-blog/overview-new-stored-procedures-postgresql-11
>
> If its an older version then Python Script or other scripting language to
> iterates over the data say 10 to 20K will do what you want
>
> for i in list of IDs
> begin ;
> "UPDATE table_name SET changed_field=new_value()  where ID @> int4range(i,
> i+1);
> commit;
>
>
> To create parallel process simple Python script or other scripting
> language can be used to create many connections working the data in
> parallel  but given the simple update it will NOT help in performance,
> this  will be Hard disk IO bound,  not process bound where parallelization
> helps
>
>
>
> On Mon, Jan 6, 2020 at 1:36 PM Israel Brewster 
> wrote:
>
>> Thanks to a change in historical data, I have a need to update a large
>> number of records (around 50 million). The update itself is straight
>> forward, as I can just issue an "UPDATE table_name SET
>> changed_field=new_value();" (yes, new_value is the result of a stored
>> procedure, if that makes a difference) command via psql, and it should
>> work. However, due to the large number of records this command will
>> obviously take a while, and if anything goes wrong during the update (one
>> bad value in row 45 million, lost connection, etc), all the work that has
>> been done already will be lost due to the transactional nature of such
>> commands (unless I am missing something).
>>
>> Given that each row update is completely independent of any other row, I
>> have the following questions:
>>
>> 1) Is there any way to set the command such that each row change is
>> committed as it is calculated?
>> 2) Is there some way to run this command in parallel in order to better
>> utilize multiple processor cores, other than manually breaking the data
>> into chunks and running a separate psql/update process for each chunk?
>> Honestly, manual parallelizing wouldn’t be too bad (there are a number of
>> logical segregations I can apply), I’m just wondering if there is a more
>> automatic option.
>> ---
>> Israel Brewster
>> Software Engineer
>> Alaska Volcano Observatory
>> Geophysical Institute - UAF
>> 2156 Koyukuk Drive
>> Fairbanks AK 99775-7320
>> Work: 907-474-5172
>> cell:  907-328-9145
>>
>>
>


Re: UPDATE many records

2020-01-06 Thread Christopher Browne
On Mon, Jan 6, 2020, 3:15 PM Israel Brewster  wrote:

> On Jan 6, 2020, at 10:08 AM, Christopher Browne 
> wrote:
>
> On Mon, 6 Jan 2020 at 13:36, Israel Brewster 
> wrote:
>
>> Thanks to a change in historical data, I have a need to update a large
>> number of records (around 50 million). The update itself is straight
>> forward, as I can just issue an "UPDATE table_name SET
>> changed_field=new_value();" (yes, new_value is the result of a stored
>> procedure, if that makes a difference) command via psql, and it should
>> work. However, due to the large number of records this command will
>> obviously take a while, and if anything goes wrong during the update (one
>> bad value in row 45 million, lost connection, etc), all the work that has
>> been done already will be lost due to the transactional nature of such
>> commands (unless I am missing something).
>>
>> Given that each row update is completely independent of any other row, I
>> have the following questions:
>>
>> 1) Is there any way to set the command such that each row change is
>> committed as it is calculated?
>> 2) Is there some way to run this command in parallel in order to better
>> utilize multiple processor cores, other than manually breaking the data
>> into chunks and running a separate psql/update process for each chunk?
>> Honestly, manual parallelizing wouldn’t be too bad (there are a number of
>> logical segregations I can apply), I’m just wondering if there is a more
>> automatic option.
>>
>
> Yeah, I'd be inclined to do this in batches.
>
> If, for instance, the table has a nice primary key, then I'd capture the
> primary keys into a side table, and grab tuples from the side table to
> process in more bite-sized batches, say, of a few thousand tuples per batch.
>
> create table just_keys as select pk_column from big_historical_table;
> alter table just_keys add column processed boolean;
> create index jkpk on just_keys(pk_column) where (processed is null);
> then loop repeatedly along the lines...
>
> create temp table iteration as select pk_column from just_keys where
> processed is null limit 1000;
> [do update on big_historical_table where pk_column in (select pk_column
> from iteration)]
> update iteration set processed='true' where pk_column in (select pk_column
> from iteration);
> drop table iteration;
>
>
> Parallelization is absolutely an interesting idea; if you want to use 8
> processes, then use a cycling sequence on the side table to spread tuples
> across the 8 processes, so that they can grab their own tuples and not
> block one another.
>
> In that case, more like...
> create temp sequence seq_procs start with 1 maxval 8 cycle;
> create temp table just_keys as select pk_column, false::boolean as
> processed, nextval('seq_procs') as batch_id from big_historical_table;
>
> The individual iterations then look for values in just_keys corresponding
> to their assigned batch number.
>
>
> Sounds like a reasonable approach. As Justin pointed out, it is actually
> likely that the process will be IO bound rather than CPU bound, so my
> parallel idea may not have much merit after all, but the batching procedure
> makes sense. I assume you meant update just_keys in your sample rather than
> update iteration on that line just before drop table iteration. Thanks for
> the info
>

As for parallelism, if you have really powerful disk, lots of disks on disk
array, it may help.  Or not, as commented.

I didn't test my wee bit of code, so yep, I meant to update just_keys :-).

You won't find something terribly much more automatic.

Oh, yah, there's a possible further complication; does the application need
to get stopped to do this update?  Is the newest version of the app still
generating data that needs the rewriting?  Sure hope not...


Re: UPDATE many records

2020-01-06 Thread Rob Sargent


> On Jan 6, 2020, at 1:29 PM, Alban Hertroys  wrote:
> 
> I think you’re overcomplicating the matter.
> 
> I’d just do it as a single update in one transaction. It’s only 50M rows. It 
> may take half an hour or so on decent hardware, depending on how 
> resource-intensive your function is.
> 
I must emphasize: This estimate is HIGHLY dependent on hardware and the 
complexity of the table (number of indices, etc).  (I suspect there’s a 
correlation between table size (business value) and number of indices)

> If that fails[1], only then would I start looking into batching things. But 
> then you still need to figure out why it fails and what to do about that; if 
> it fails it will probably fail fast, and if not, then you’re looking at a 
> one-off situation that won’t require more than a few workarounds - after 
> which you can just run the update again.
> 
> Ad 1). No harm has been done, it’s a single transaction that rolled back.
> 
> Alban Hertroys
> --
> If you can't see the forest for the trees,
> cut the trees and you'll find there is no forest.



Re: UPDATE many records

2020-01-06 Thread Israel Brewster
> On Jan 6, 2020, at 11:38 AM, Christopher Browne  wrote:
> 
> 
> 
> On Mon, Jan 6, 2020, 3:15 PM Israel Brewster  > wrote:
>> On Jan 6, 2020, at 10:08 AM, Christopher Browne > > wrote:
>> 
>> On Mon, 6 Jan 2020 at 13:36, Israel Brewster > > wrote:
>> Thanks to a change in historical data, I have a need to update a large 
>> number of records (around 50 million). The update itself is straight 
>> forward, as I can just issue an "UPDATE table_name SET 
>> changed_field=new_value();" (yes, new_value is the result of a stored 
>> procedure, if that makes a difference) command via psql, and it should work. 
>> However, due to the large number of records this command will obviously take 
>> a while, and if anything goes wrong during the update (one bad value in row 
>> 45 million, lost connection, etc), all the work that has been done already 
>> will be lost due to the transactional nature of such commands (unless I am 
>> missing something).
>> 
>> Given that each row update is completely independent of any other row, I 
>> have the following questions:
>> 
>> 1) Is there any way to set the command such that each row change is 
>> committed as it is calculated?
>> 2) Is there some way to run this command in parallel in order to better 
>> utilize multiple processor cores, other than manually breaking the data into 
>> chunks and running a separate psql/update process for each chunk? Honestly, 
>> manual parallelizing wouldn’t be too bad (there are a number of logical 
>> segregations I can apply), I’m just wondering if there is a more automatic 
>> option.
>> 
>> Yeah, I'd be inclined to do this in batches.
>> 
>> If, for instance, the table has a nice primary key, then I'd capture the 
>> primary keys into a side table, and grab tuples from the side table to 
>> process in more bite-sized batches, say, of a few thousand tuples per batch.
>> 
>> create table just_keys as select pk_column from big_historical_table;
>> alter table just_keys add column processed boolean;
>> create index jkpk on just_keys(pk_column) where (processed is null);
>> then loop repeatedly along the lines...
>> 
>> create temp table iteration as select pk_column from just_keys where 
>> processed is null limit 1000;
>> [do update on big_historical_table where pk_column in (select pk_column from 
>> iteration)]
>> update iteration set processed='true' where pk_column in (select pk_column 
>> from iteration);
>> drop table iteration;
>> 
>> Parallelization is absolutely an interesting idea; if you want to use 8 
>> processes, then use a cycling sequence on the side table to spread tuples 
>> across the 8 processes, so that they can grab their own tuples and not block 
>> one another.
>> 
>> In that case, more like...
>> create temp sequence seq_procs start with 1 maxval 8 cycle;
>> create temp table just_keys as select pk_column, false::boolean as 
>> processed, nextval('seq_procs') as batch_id from big_historical_table;
>> 
>> The individual iterations then look for values in just_keys corresponding to 
>> their assigned batch number.
> 
> Sounds like a reasonable approach. As Justin pointed out, it is actually 
> likely that the process will be IO bound rather than CPU bound, so my 
> parallel idea may not have much merit after all, but the batching procedure 
> makes sense. I assume you meant update just_keys in your sample rather than 
> update iteration on that line just before drop table iteration. Thanks for 
> the info
> 
> As for parallelism, if you have really powerful disk, lots of disks on disk 
> array, it may help.  Or not, as commented.
> 
> I didn't test my wee bit of code, so yep, I meant to update just_keys :-).
> 
> You won't find something terribly much more automatic.
> 
> Oh, yah, there's a possible further complication; does the application need 
> to get stopped to do this update?  Is the newest version of the app still 
> generating data that needs the rewriting?  Sure hope not…

Yeah, a valid concern, but I should be ok on that front. Once I fix the 
calculation function, any new records will have the correct value. Plus, the 
actual update calculation is idempotent, so if a handful of new records end up 
getting re-calculated, that’s not an issue. Granted, the data will look weird 
while the re-calculation is in process (part new, part old), but we can live 
with that :-)

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145



Re: UPDATE many records

2020-01-06 Thread Israel Brewster


> On Jan 6, 2020, at 11:40 AM, Rob Sargent  wrote:
> 
> 
> 
>> On Jan 6, 2020, at 1:29 PM, Alban Hertroys > > wrote:
>> 
>> I think you’re overcomplicating the matter.
>> 
>> I’d just do it as a single update in one transaction. It’s only 50M rows. It 
>> may take half an hour or so on decent hardware, depending on how 
>> resource-intensive your function is.
>> 
> I must emphasize: This estimate is HIGHLY dependent on hardware and the 
> complexity of the table (number of indices, etc).  (I suspect there’s a 
> correlation between table size (business value) and number of indices)

I’m thinking it might be worth it to do a “quick” test on 1,000 or so records 
(or whatever number can run in a minute or so), watching the processor 
utilization as it runs. That should give me a better feel for where the 
bottlenecks may be, and how long the entire update process would take. I’m 
assuming, of course, that the total time would scale more or less linearly with 
the number of records.

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

> 
>> If that fails[1], only then would I start looking into batching things. But 
>> then you still need to figure out why it fails and what to do about that; if 
>> it fails it will probably fail fast, and if not, then you’re looking at a 
>> one-off situation that won’t require more than a few workarounds - after 
>> which you can just run the update again.
>> 
>> Ad 1). No harm has been done, it’s a single transaction that rolled back.
>> 
>> Alban Hertroys
>> --
>> If you can't see the forest for the trees,
>> cut the trees and you'll find there is no forest.
> 



Re: UPDATE many records

2020-01-06 Thread Michael Lewis
>
> I’m thinking it might be worth it to do a “quick” test on 1,000 or so
> records (or whatever number can run in a minute or so), watching the
> processor utilization as it runs. That should give me a better feel for
> where the bottlenecks may be, and how long the entire update process would
> take. I’m assuming, of course, that the total time would scale more or less
> linearly with the number of records.
>

I think that depends on how your identify and limit the update to those
1000 records. If it is using a primary key with specific keys in an array,
probably close to linear increase because the where clause isn't impactful
to the overall execution time. If you write a sub-query that is slow, then
you would need to exclude that from the time. You can always run explain
analyze on the update and rollback rather than commit.


Re: UPDATE many records

2020-01-06 Thread Israel Brewster

> On Jan 6, 2020, at 11:54 AM, Michael Lewis  wrote:
> 
> I’m thinking it might be worth it to do a “quick” test on 1,000 or so records 
> (or whatever number can run in a minute or so), watching the processor 
> utilization as it runs. That should give me a better feel for where the 
> bottlenecks may be, and how long the entire update process would take. I’m 
> assuming, of course, that the total time would scale more or less linearly 
> with the number of records.
> 
> I think that depends on how your identify and limit the update to those 1000 
> records. If it is using a primary key with specific keys in an array, 
> probably close to linear increase because the where clause isn't impactful to 
> the overall execution time. If you write a sub-query that is slow, then you 
> would need to exclude that from the time. You can always run explain analyze 
> on the update and rollback rather than commit.
So a test run on 9,299 records took about 7 seconds to complete (EXPLAIN 
ANALYZE output at https://explain.depesz.com/s/lIYn 
 if it matters), during which time I did see 
a postmaster process consuming 100% CPU. Upping the test to 20,819 records took 
about 16.5 seconds, so that looks relatively linear to me. Also, CPU bound. So 
by my calculations, doing all 50M records would take around 10 hours.

One potentially significant note: most of the execution time is spent in a 
trigger. This trigger is actually what’s doing the REAL update that I need to 
happen. If it would make a difference, I could easily pull the trigger code out 
to a separate function that I just call directly (with triggers temporarily 
disabled). My thinking is that calling a function is calling a function, and 
the fact that it is currently called via a trigger rather than direct is of 
little consequence, but I’m willing to be corrected on that :-)

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145



RE: UPDATE many records

2020-01-06 Thread Mark Zellers
Just out of curiosity, what kind of trigger are you using, a row level trigger 
or a statement level trigger?  If you are using a row level trigger, see if you 
can achieve your requirements using a statement level trigger instead.  I’m 
relatively new to Postgres, so there could be some limit that I’m not aware of, 
but my understanding is that you have access to the old and new values of the 
updated rows in the after statement trigger.  It would likely be much more 
performant to do your operation once after the statement is done rather than 
firing a trigger on every changed row.

Regards,

Mark Z.


From: Israel Brewster 
Sent: Monday, January 6, 2020 1:24 PM
To: Michael Lewis 
Cc: Rob Sargent ; Alban Hertroys ; 
Christopher Browne ; pgsql-generallists.postgresql.org 

Subject: Re: UPDATE many records


On Jan 6, 2020, at 11:54 AM, Michael Lewis 
mailto:mle...@entrata.com>> wrote:

I’m thinking it might be worth it to do a “quick” test on 1,000 or so records 
(or whatever number can run in a minute or so), watching the processor 
utilization as it runs. That should give me a better feel for where the 
bottlenecks may be, and how long the entire update process would take. I’m 
assuming, of course, that the total time would scale more or less linearly with 
the number of records.

I think that depends on how your identify and limit the update to those 1000 
records. If it is using a primary key with specific keys in an array, probably 
close to linear increase because the where clause isn't impactful to the 
overall execution time. If you write a sub-query that is slow, then you would 
need to exclude that from the time. You can always run explain analyze on the 
update and rollback rather than commit.
So a test run on 9,299 records took about 7 seconds to complete (EXPLAIN 
ANALYZE output at https://explain.depesz.com/s/lIYn if it matters), during 
which time I did see a postmaster process consuming 100% CPU. Upping the test 
to 20,819 records took about 16.5 seconds, so that looks relatively linear to 
me. Also, CPU bound. So by my calculations, doing all 50M records would take 
around 10 hours.

One potentially significant note: most of the execution time is spent in a 
trigger. This trigger is actually what’s doing the REAL update that I need to 
happen. If it would make a difference, I could easily pull the trigger code out 
to a separate function that I just call directly (with triggers temporarily 
disabled). My thinking is that calling a function is calling a function, and 
the fact that it is currently called via a trigger rather than direct is of 
little consequence, but I’m willing to be corrected on that :-)

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145



Re: jsonb_set() strictness considered harmful to data

2020-01-06 Thread Andrew Dunstan
On Thu, Nov 28, 2019 at 2:15 PM Andrew Dunstan
 wrote:
>
>
> On 11/27/19 9:35 PM, Michael Paquier wrote:
> > On Fri, Nov 15, 2019 at 09:45:59PM +0100, Pavel Stehule wrote:
> >> Maybe ERRCODE_NULL_VALUE_NOT_ALLOWED, and "NULL is not allowed",
> >> errdetail - a exception due setting "null_value_treatment" =>
> >> raise_exception
> >> and maybe some errhint - "Maybe you would to use Jsonb NULL - 
> >> "null"::jsonb"
> >>
> >> I don't know, but in this case, the exception should be verbose. This is
> >> "rich" function with lot of functionality
> > @Andrew: This patch is waiting on input from you for a couple of days
> > now.
> >
>
>


Updated version including docco and better error message.

cheers

andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 57a1539506..7adb6a2d04 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -12203,6 +12203,9 @@ table2-mapping
   
jsonb_set
   
+  
+   jsonb_set_lax
+  
   
jsonb_insert
   
@@ -12517,6 +12520,26 @@ table2-mapping
  [{"f1": 1, "f2": null, "f3": [2, 3, 4]}, 2]
 

+  
+   jsonb_set_lax(target jsonb, path text[], new_value jsonb , create_missing boolean , null_value_treatment text)
+ 
+   jsonb
+   
+If new_value is not null,
+behaves identically to jsonb_set. Otherwise behaves
+according to the value of null_value_treatment
+which must be one of 'raise_exception',
+'use_json_null', 'delete_key', or
+'return_target'. The default is
+'use_json_null'.
+   
+   jsonb_set_lax('[{"f1":1,"f2":null},2,null,3]', '{0,f1}',null)
+ jsonb_set_lax('[{"f1":99,"f2":null},2]', '{0,f3}',null, true, 'return_target')
+ 
+   [{"f1":null,"f2":null},2,null,3]
+ [{"f1": 99, "f2": null}, 2]
+
+   
   


diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2fc3e3ff90..1cb2af1bcd 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1237,6 +1237,15 @@ LANGUAGE INTERNAL
 STRICT IMMUTABLE PARALLEL SAFE
 AS 'jsonb_set';
 
+CREATE OR REPLACE FUNCTION
+  jsonb_set_lax(jsonb_in jsonb, path text[] , replacement jsonb,
+create_if_missing boolean DEFAULT true,
+null_value_treatment text DEFAULT 'use_json_null')
+RETURNS jsonb
+LANGUAGE INTERNAL
+CALLED ON NULL INPUT IMMUTABLE PARALLEL SAFE
+AS 'jsonb_set_lax';
+
 CREATE OR REPLACE FUNCTION
   parse_ident(str text, strict boolean DEFAULT true)
 RETURNS text[]
diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c
index ab5a24a858..b5fa4e7d18 100644
--- a/src/backend/utils/adt/jsonfuncs.c
+++ b/src/backend/utils/adt/jsonfuncs.c
@@ -4395,6 +4395,70 @@ jsonb_set(PG_FUNCTION_ARGS)
 }
 
 
+/*
+ * SQL function jsonb_set_lax(jsonb, text[], jsonb, boolean, text)
+ */
+Datum
+jsonb_set_lax(PG_FUNCTION_ARGS)
+{
+	/* Jsonb	   *in = PG_GETARG_JSONB_P(0); */
+	/* ArrayType  *path = PG_GETARG_ARRAYTYPE_P(1); */
+	/* Jsonb	  *newval = PG_GETARG_JSONB_P(2); */
+	/* bool		create = PG_GETARG_BOOL(3); */
+	text   *handle_null;
+	char   *handle_val;
+
+	if (PG_ARGISNULL(0) || PG_ARGISNULL(1) || PG_ARGISNULL(3))
+		PG_RETURN_NULL();
+
+	/* could happen if they pass in an explicit NULL */
+	if (PG_ARGISNULL(4))
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("need delete_key, return_target, use_json_null, or raise_exception")));
+
+	/* if the new value isn't an SQL NULL just call jsonb_set */
+	if (! PG_ARGISNULL(2))
+		return jsonb_set(fcinfo);
+
+	handle_null = PG_GETARG_TEXT_P(4);
+	handle_val = text_to_cstring(handle_null);
+
+	if (strcmp(handle_val,"raise_exception") == 0)
+	{
+		ereport(ERROR,
+(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
+ errmsg("NULL is not allowed"),
+ errdetail("exception raised due to \"null_value_treatment := 'raise_exception'\""),
+ errhint("to avoid, either change the null_value_treatment argument or ensure that an SQL NULL is not used")));
+	}
+	else if (strcmp(handle_val, "use_json_null") == 0)
+	{
+		Datum	  newval;
+
+		newval = DirectFunctionCall1(jsonb_in, CStringGetDatum("null"));
+
+		fcinfo->args[2].value = newval;
+		fcinfo->args[2].isnull = false;
+		return jsonb_set(fcinfo);
+	}
+	else if (strcmp(handle_val, "delete_key") == 0)
+	{
+		return jsonb_delete_path(fcinfo);
+	}
+	else if (strcmp(handle_val, "return_target") == 0)
+	{
+		Jsonb	   *in = PG_GETARG_JSONB_P(0);
+		PG_RETURN_JSONB_P(in);
+	}
+	else
+	{
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("need delete_key, return_target, use_json_null, or raise_exception")));
+	}
+}
+
 /*
  * SQL function jsonb_delete_path(jsonb, text[])
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_

Re: UPDATE many records

2020-01-06 Thread Michael Lewis
On Mon, Jan 6, 2020 at 2:34 PM Mark Zellers 
wrote:

> Just out of curiosity, what kind of trigger are you using, a row level
> trigger or a statement level trigger?  If you are using a row level
> trigger, see if you can achieve your requirements using a statement level
> trigger instead.  I’m relatively new to Postgres, so there could be some
> limit that I’m not aware of, but my understanding is that you have access
> to the old and new values of the updated rows in the after statement
> trigger.  It would likely be much more performant to do your operation once
> after the statement is done rather than firing a trigger on every changed
> row.
>

My experience/understanding is that statement level triggers can be a big
performance boost, but only for changing *other* tables and not the table
that the trigger is on since it is *AFTER* only and can't modify NEW record
directly.


Re: UPDATE many records

2020-01-06 Thread Justin
What was the HD wait time ?  What tool is being use to monitor the server
resources??

It appears based on this information there is allot more going on than a
simple Update command

Moving code out of the trigger  probably not  going to improve performance,
unless there is allot of code  that does not need to be processed for this
update or code touching other tables

Study the trigger identify what has to run, pull that code out, then
disable the trigger.Move the necessary code to a new function for
Updating..

On Mon, Jan 6, 2020 at 4:24 PM Israel Brewster 
wrote:

>
> On Jan 6, 2020, at 11:54 AM, Michael Lewis  wrote:
>
> I’m thinking it might be worth it to do a “quick” test on 1,000 or so
>> records (or whatever number can run in a minute or so), watching the
>> processor utilization as it runs. That should give me a better feel for
>> where the bottlenecks may be, and how long the entire update process would
>> take. I’m assuming, of course, that the total time would scale more or less
>> linearly with the number of records.
>>
>
> I think that depends on how your identify and limit the update to those
> 1000 records. If it is using a primary key with specific keys in an array,
> probably close to linear increase because the where clause isn't impactful
> to the overall execution time. If you write a sub-query that is slow, then
> you would need to exclude that from the time. You can always run explain
> analyze on the update and rollback rather than commit.
>
> So a test run on 9,299 records took about 7 seconds to complete (EXPLAIN
> ANALYZE output at https://explain.depesz.com/s/lIYn if it matters),
> during which time I did see a postmaster process consuming 100% CPU. Upping
> the test to 20,819 records took about 16.5 seconds, so that looks
> relatively linear to me. Also, CPU bound. So by my calculations, doing all
> 50M records would take around 10 hours.
>
> One potentially significant note: most of the execution time is spent in a
> trigger. This trigger is actually what’s doing the REAL update that I need
> to happen. If it would make a difference, I could easily pull the trigger
> code out to a separate function that I just call directly (with triggers
> temporarily disabled). My thinking is that calling a function is calling a
> function, and the fact that it is currently called via a trigger rather
> than direct is of little consequence, but I’m willing to be corrected on
> that :-)
>
> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory
> Geophysical Institute - UAF
> 2156 Koyukuk Drive
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell:  907-328-9145
>
>


Re: UPDATE many records

2020-01-06 Thread Israel Brewster

> On Jan 6, 2020, at 12:49 PM, Justin  wrote:
> 
> What was the HD wait time ?  What tool is being use to monitor the server 
> resources??

No idea on the HD wait time - how would I best monitor that? That said, this 
machine does have NVMe drives, so the speed should be fairly high/wait time 
fairly low. It’s also running as a VM, which could affect things, but is a bit 
of a moot point as far as this update goes. As far as monitoring server 
resources, I was just using top.

> It appears based on this information there is allot more going on than a 
> simple Update command

Depending on your definition of “simple update” of course, very true. As I 
stated in the original message, the actual update value is the result of a 
function. The psql command is a simple update, but the function does a bit of 
stuff (primarily trigonometry). According to the EXPLAIN ANALYZE, about .7 ms 
of stuff per record, which of course is most of the runtime. It is entirely 
possible that the function could be optimized to run more quickly.

> 
> Moving code out of the trigger  probably not  going to improve performance, 
> unless there is allot of code  that does not need to be processed for this 
> update or code touching other tables

One SELECT query on another table to get some values I need to use for the 
calculation. No code that is not needed for the update. Given the nature of 
this bulk update, I *could* make a separate function that simply takes those 
values as parameters, since the same value will be applied to a lot of records. 
I’d just have to be careful about how I applied the update, so rows get 
processed with the correct values. I’m not convinced it would be worth it 
though - might shave a few hours off the total execution time (assuming that 
SELECT is expensive - EXPLAIN ANLYZE shows an index scan, on a table with only 
12,761 rows, which seems to be about as simple as it gets), but I doubt it 
would be enough for me to feel comfortable simply running the update as one 
monolithic unit.

> Study the trigger identify what has to run, pull that code out, then disable 
> the trigger.Move the necessary code to a new function for Updating.. 

Sure. But I feel we are getting a bit off track. Optimizing the runtime of the 
update is great, but this is a one-off (hopefully) event. I want to accomplish 
it as quickly as possible, of course, but at the same time it doesn’t make 
sense to spend a lot of time optimizing every component of the query. The main 
purpose of the question was honestly for my sanity, to reduce the likelihood of 
having it run for several hours only to error out due to bad data or whatever 
and have to start over from the top. Running in parallel simply seemed to be a 
no-brainer option to make it go quicker, assuming CPU bound updating. 
Optimizations that are going to take work are probably not worth it. We can 
wait for the data to be updated.

Thanks again!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145


> 
> On Mon, Jan 6, 2020 at 4:24 PM Israel Brewster  > wrote:
> 
>> On Jan 6, 2020, at 11:54 AM, Michael Lewis > > wrote:
>> 
>> I’m thinking it might be worth it to do a “quick” test on 1,000 or so 
>> records (or whatever number can run in a minute or so), watching the 
>> processor utilization as it runs. That should give me a better feel for 
>> where the bottlenecks may be, and how long the entire update process would 
>> take. I’m assuming, of course, that the total time would scale more or less 
>> linearly with the number of records.
>> 
>> I think that depends on how your identify and limit the update to those 1000 
>> records. If it is using a primary key with specific keys in an array, 
>> probably close to linear increase because the where clause isn't impactful 
>> to the overall execution time. If you write a sub-query that is slow, then 
>> you would need to exclude that from the time. You can always run explain 
>> analyze on the update and rollback rather than commit.
> So a test run on 9,299 records took about 7 seconds to complete (EXPLAIN 
> ANALYZE output at https://explain.depesz.com/s/lIYn 
>  if it matters), during which time I did 
> see a postmaster process consuming 100% CPU. Upping the test to 20,819 
> records took about 16.5 seconds, so that looks relatively linear to me. Also, 
> CPU bound. So by my calculations, doing all 50M records would take around 10 
> hours.
> 
> One potentially significant note: most of the execution time is spent in a 
> trigger. This trigger is actually what’s doing the REAL update that I need to 
> happen. If it would make a difference, I could easily pull the trigger code 
> out to a separate function that I just call directly (with triggers 
> temporarily disabled). My thinking is that calling a fu

Table Interfaces (Inheritance again. I know. I'm sorry.)

2020-01-06 Thread Miles Elam
Searching through the lists for "inheritance" yields a War and Peace level
of content back. "inheritance harmful" and "inheritance interface" didn't
answer my question.

What about cross-cutting concerns like what might find with interfaces in
many OO languages? For example:

  CREATE TABLE auditable (
id uuid NOT NULL,
actor varchar NOT NULL, -- username
created timestamptz NOT NULL
  );

  CREATE TABLE authable (
id uuid NOT NULL,
actor varchar NOT NULL, -- username
access_tags varchar[]
  );

  CREATE TABLE a (  ...  ) INHERITS (auditable);
  CREATE TABLE b (  ...  ) INHERITS (auditable, authable);
  CREATE TABLE c (  ...  ) INHERITS (authable);

Most of the discussion I've seen regarding inheritance has centered on
cases where a base table with foreign key relationships for specialization
would suffice. In the case above, it serves ONLY to enforce consistent
style and allow for utility functions to share logic due to the
aforementioned consistent naming. Uniqueness is not necessary (but UUIDs
handle that anyway), the tableoid is present for reverse lookups when
needed, defaults, check constraints, et al. are not a concern. But as an
"interface", the hierarchy isn't like one would find in traditional OOP
extends, more like Java's interfaces.

There are also avenues for using event triggers to detect the creation of
tables that conform to certain "interfaces" to automatically add certain
logic. For example:

  CREATE TABLE track_modified (
last_modified timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP
  );

  CREATE TABLE x ( ... ) INHERITS (track_modified);
  CREATE TABLE y ( ... ) INHERITS (track_modified);
  CREATE TABLE z ( ... ) INHERITS (track_modified);

Where an event trigger would fire, see that each table has a last_modified
column, and an expectation to update the last modification a la MySQL's ON
UPDATE CURRENT_TIMESTAMP.

An empty parent table could even be useful for an event trigger to add a
regular set of policies without repeating yourself (and keeping them all in
sync after the fact).

That said, I've also seen messages dating all the way back to 2005 claiming
that INHERITS is a red-headed stepchild that would be phased out shortly
after table partitioning landed and matured. Is this still the intention?
Is inheritance just considered such an old feature that no one dare remove
due to breaking users? Or have folks come to find good use cases for it
even though it's used for far too many use cases?

I apologize in advance for stirring a pot, encouraging the beating of an
already dead horse, or exhausting with cliches. I have seen the notices on
the wiki about inheritance being useful for temporal logic but not much
else since proper table partitioning was introduced. By and large I agree
with the reasoning, especially with regard to unique keys and their lack of
propagation. It just didn't seem to address the interface model one way or
another.

- Miles Elam


Re: UPDATE many records

2020-01-06 Thread Christopher Browne
On Mon, 6 Jan 2020 at 17:38, Israel Brewster  wrote:

>
> Sure. But I feel we are getting a bit off track. Optimizing the runtime of
> the update is great, but this is a one-off (hopefully) event. I want to
> accomplish it as quickly as possible, of course, but at the same time it
> doesn’t make sense to spend a lot of time optimizing every component of the
> query. The main purpose of the question was honestly for my sanity, to
> reduce the likelihood of having it run for several hours only to error out
> due to bad data or whatever and have to start over from the top. Running in
> parallel simply seemed to be a no-brainer option to make it go quicker,
> assuming CPU bound updating. Optimizations that are going to take work are
> probably not worth it. We can wait for the data to be updated.
>

It sounds like you're in a decent place on this, and that you have done a
pretty apropos amount of exploration of the matter.

I was pleased to hear that you have the idempotency of the updates well in
hand, and that the application can cope with the degree of out-of-sync that
things will temporarily be.

The estimate of 10h to update the data doesn't surprise me; that's long
enough that it sure seems tempting to do the work in pieces so that you
don't have your whole set of application data locked for 10h.

I'd be inclined to call this "enough attention" for a one-off event.

I'll poke at the trigger aspect a wee bit; if the trigger function does a
one-tuple-at-a-time handling of things, so that it fires 50M times, you
might get a substantial speedup by replacing that with an equivalent set
operation that processes a few thousand tuples at a time.  That said, if
you're happy with the process running 10h, it's not worth unpeeling the
extra testing needed to ensure identical end states.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: UPDATE many records

2020-01-06 Thread Justin
system monitor i like is glances,  have to have python installed to run,
it can run in command console or present a web interface.  its a very nice
to  get a quick detail few what is going on
https://nicolargo.github.io/glances/

just monitoring the system CPU utilization does not give enough information
to state if the system is CPU bound or IO bound or network IO bound.

on simple selects assuming the data is in shared buffers its trivial
event,  if the select has to go to disk to get the data then its not
trivial especially if its a big table.  One thing that can happen is the
update and the selects in the trigger are pushing data in and out of shared
memory causing the disk to thrash which explain does not report.

 Explain does not peer into function, triggers, etc   select myfunction(),
column list, from mytable.  or update mytable set column = myfunction() ,
it will not report what happened inside just the total time it took to run.
looking at the explain the Update only took 0.6 second while the trigger
took 6.5

rotate_tilt_data: time=6560.401 calls=9299

I'm a little confused is this trigger function going to be around after
this update?  If yes it should be worth looking into

If its a one time run or every 12 months who cares,  Start the update on
friday night,  go in on Saturday to check it

On Mon, Jan 6, 2020 at 5:38 PM Israel Brewster 
wrote:

>
> On Jan 6, 2020, at 12:49 PM, Justin  wrote:
>
> What was the HD wait time ?  What tool is being use to monitor the server
> resources??
>
>
> No idea on the HD wait time - how would I best monitor that? That said,
> this machine does have NVMe drives, so the speed should be fairly high/wait
> time fairly low. It’s also running as a VM, which could affect things, but
> is a bit of a moot point as far as this update goes. As far as monitoring
> server resources, I was just using top.
>
> It appears based on this information there is allot more going on than a
> simple Update command
>
>
> Depending on your definition of “simple update” of course, very true. As I
> stated in the original message, the actual update value is the result of a
> function. The psql command is a simple update, but the function does a bit
> of stuff (primarily trigonometry). According to the EXPLAIN ANALYZE, about
> .7 ms of stuff per record, which of course is most of the runtime. It is
> entirely possible that the function could be optimized to run more quickly.
>
>
> Moving code out of the trigger  probably not  going to improve
> performance, unless there is allot of code  that does not need to be
> processed for this update or code touching other tables
>
>
> One SELECT query on another table to get some values I need to use for the
> calculation. No code that is not needed for the update. Given the nature of
> this bulk update, I *could* make a separate function that simply takes
> those values as parameters, since the same value will be applied to a lot
> of records. I’d just have to be careful about how I applied the update, so
> rows get processed with the correct values. I’m not convinced it would be
> worth it though - might shave a few hours off the total execution time
> (assuming that SELECT is expensive - EXPLAIN ANLYZE shows an index scan, on
> a table with only 12,761 rows, which seems to be about as simple as it
> gets), but I doubt it would be enough for me to feel comfortable simply
> running the update as one monolithic unit.
>
> Study the trigger identify what has to run, pull that code out, then
> disable the trigger.Move the necessary code to a new function for
> Updating..
>
>
> Sure. But I feel we are getting a bit off track. Optimizing the runtime of
> the update is great, but this is a one-off (hopefully) event. I want to
> accomplish it as quickly as possible, of course, but at the same time it
> doesn’t make sense to spend a lot of time optimizing every component of the
> query. The main purpose of the question was honestly for my sanity, to
> reduce the likelihood of having it run for several hours only to error out
> due to bad data or whatever and have to start over from the top. Running in
> parallel simply seemed to be a no-brainer option to make it go quicker,
> assuming CPU bound updating. Optimizations that are going to take work are
> probably not worth it. We can wait for the data to be updated.
>
> Thanks again!
>
> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory
> Geophysical Institute - UAF
> 2156 Koyukuk Drive
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell:  907-328-9145
>
>
>
> On Mon, Jan 6, 2020 at 4:24 PM Israel Brewster 
> wrote:
>
>>
>> On Jan 6, 2020, at 11:54 AM, Michael Lewis  wrote:
>>
>> I’m thinking it might be worth it to do a “quick” test on 1,000 or so
>>> records (or whatever number can run in a minute or so), watching the
>>> processor utilization as it runs. That should give me a better feel for
>>> where the bottlenecks may be, and how long the entire update process would
>>> tak

pg_repack failure

2020-01-06 Thread Nagaraj Raj
Hello,

When I tried to repack my bloated table an error occurred:

FATAL: terminating connection due to idle-in-transaction timeout
ERROR: query failed: SSL connection has been closed unexpectedly
DETAIL: query was: SAVEPOINT repack_sp1

and this error is occurring in large tables only, and current table size which 
is running about 700GB

/pg_repack --version
pg_repack 1.4.3

DB version: PostgreSQL 9.6.11 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 
4.9.3, 64-bit



Appreciate the assistance.
FATAL: terminating connection due to idle-in-transaction timeout · Issue #222 · 
reorg/pg_repack

| 
| 
| 
|  |  |

 |

 |
| 
|  | 
FATAL: terminating connection due to idle-in-transaction timeout · Issu...

When I tried to repack my bloated table an error occurred: FATAL: terminating 
connection due to idle-in-transact...
 |

 |

 |




Regards,Nagaraj

Re: pg_repack failure

2020-01-06 Thread Michael Paquier
On Tue, Jan 07, 2020 at 06:15:09AM +, Nagaraj Raj wrote:
> and this error is occurring in large tables only, and current table
> size which is running about 700GB
> 
> /pg_repack --version
> pg_repack 1.4.3
> 
> DB version: PostgreSQL 9.6.11 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 
> 4.9.3, 64-bit

I think that you had better report that directly to the maintainers of
the tool here:
https://github.com/reorg/pg_repack/
--
Michael


signature.asc
Description: PGP signature