Use of max_slot_wal_keep_size parameter

2024-03-26 Thread Don Seiler
Good morning,

With our recent upgrade to PG 15, we've put the max_slot_wal_keep_size into
place, setting it about 3% lower than the size of the volume dedicated to
pg_wal. However last night we had an incident where the volume filled up
while we were performing a massive insert from one table into another.

I don't believe the replica was lagging, and we have pgbackrest configured
for async archiving, which I assume was fine. It seemed to just be a matter
of PG not cleaning up the WAL. Our understanding was that
max_slot_wal_keep_size would prevent this but perhaps it only deals with
situations where the replication slot lag is a factor?

Don.

-- 
Don Seiler
www.seiler.us


Re: Use of max_slot_wal_keep_size parameter

2024-03-26 Thread Tom Lane
Don Seiler  writes:
> With our recent upgrade to PG 15, we've put the max_slot_wal_keep_size into
> place, setting it about 3% lower than the size of the volume dedicated to
> pg_wal. However last night we had an incident where the volume filled up
> while we were performing a massive insert from one table into another.

> I don't believe the replica was lagging, and we have pgbackrest configured
> for async archiving, which I assume was fine. It seemed to just be a matter
> of PG not cleaning up the WAL. Our understanding was that
> max_slot_wal_keep_size would prevent this but perhaps it only deals with
> situations where the replication slot lag is a factor?

My immediate reaction is that 3% is a mighty small margin for error.
I don't know exactly how max_slot_wal_keep_size is enforced these
days, but in the past restrictions like that were implemented by
deciding during a checkpoint whether to unlink a no-longer-needed WAL
file (if we had too much WAL) or rename/recycle it to become a future
WAL segment (if not).  So you could overshoot the specified target by
more or less the amount of WAL that could be emitted between two
checkpoints.  Perhaps it's tighter nowadays, but I really doubt that
it's exact-to-the-kilobyte-at-all-times.

regards, tom lane




Re: Use of max_slot_wal_keep_size parameter

2024-03-26 Thread Don Seiler
On Tue, Mar 26, 2024 at 9:09 AM Tom Lane  wrote:

> My immediate reaction is that 3% is a mighty small margin for error.
> I don't know exactly how max_slot_wal_keep_size is enforced these
> days, but in the past restrictions like that were implemented by
> deciding during a checkpoint whether to unlink a no-longer-needed WAL
> file (if we had too much WAL) or rename/recycle it to become a future
> WAL segment (if not).  So you could overshoot the specified target by
> more or less the amount of WAL that could be emitted between two
> checkpoints.  Perhaps it's tighter nowadays, but I really doubt that
> it's exact-to-the-kilobyte-at-all-times.
>

In this case, the total volume size was 60GB and we had the parameter set
to 58GB but I imagine that can still be overwhelmed quickly. Maybe we
should target a 20% buffer zone? We have wal_keep_size defaulted at 0.

Thanks,
Don.

-- 
Don Seiler
www.seiler.us


Active sessions does not terminated due to statement_timeout

2024-03-26 Thread Ц

Greetings!
I’ve faced with strange behavior when I see a lot of active sessions started 
hours ago while statement_timeout = '30min'.
All of them are fetching from cursors.
 
Typical session looks like:
backend_start    | 2024-03-26 14:34:20.552594+03
xact_start       | 2024-03-26 14:34:54.974628+03
query_start     | 2024-03-26 14:35:02.024133+03
state_change | 2024-03-26 14:35:02.024134+03
wait_event_type | Client
wait_event      | ClientWrite
state           | active
backend_xid   | 23240392
backend_xmin    | 23226474
query           | fetch all from ""
backend_type | client backend
 
 
They are accumulating up to tens by the end of the day with all negative 
impacts on performance.
Initially I thought that clients already died but due to network issues 
database considers them to be alive. So I set tcp_keepalive GUCs to nonzero 
values. Without success.
Then I checked connections from the app server side and found them in 
ESTABLISHED state.
It's certainly an application fault and it should not hold cursor forever...but
 
Is the any GUC parameters to fight with such «clients»?
 
 

Re: Use of max_slot_wal_keep_size parameter

2024-03-26 Thread Tom Lane
Don Seiler  writes:
> On Tue, Mar 26, 2024 at 9:09 AM Tom Lane  wrote:
>> ... So you could overshoot the specified target by
>> more or less the amount of WAL that could be emitted between two
>> checkpoints.  Perhaps it's tighter nowadays, but I really doubt that
>> it's exact-to-the-kilobyte-at-all-times.

> In this case, the total volume size was 60GB and we had the parameter set
> to 58GB but I imagine that can still be overwhelmed quickly. Maybe we
> should target a 20% buffer zone? We have wal_keep_size defaulted at 0.

Can you measure your WAL emission rate?  What you want I think is some
small multiple of the max amount of WAL emitted between checkpoints.

regards, tom lane




Re: Active sessions does not terminated due to statement_timeout

2024-03-26 Thread Magnus Hagander
On Tue, Mar 26, 2024 at 3:19 PM Ц  wrote:

> Greetings!
> I’ve faced with strange behavior when I see a lot of active sessions
> started hours ago while statement_timeout = '30min'.
> All of them are fetching from cursors.
>
> Typical session looks like:
> backend_start| 2024-03-26 14:34:20.552594+03
> xact_start   | 2024-03-26 14:34:54.974628+03
> query_start | 2024-03-26 14:35:02.024133+03
> state_change | 2024-03-26 14:35:02.024134+03
> wait_event_type | Client
> wait_event  | ClientWrite
> state   | active
> backend_xid   | 23240392
> backend_xmin| 23226474
> query   | fetch all from ""
> backend_type | client backend
>
>
> They are accumulating up to tens by the end of the day with all negative
> impacts on performance.
> Initially I thought that clients already died but due to network issues
> database considers them to be alive. So I set tcp_keepalive GUCs to nonzero
> values. Without success.
> Then I checked connections from the app server side and found them in
> ESTABLISHED state.
> It's certainly an application fault and it should not hold cursor
> forever...but
>
> Is the any GUC parameters to fight with such «clients»?
>
>
I wonder if this might be the bug I saw in
https://www.postgresql.org/message-id/cabuevexbm_va9+iw0kgvuzbrlduz8vnl2wo2ig7jqqdgsy8...@mail.gmail.com
-- basically that there's some path when we're in ClientWrite that it
doesn't check for interrupts properly. I've unfortunately not had time to
dig into that one anymore.

What version of PostgreSQL and what platform are you on?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


recovery.signal not being removed when recovery complete

2024-03-26 Thread Isaac Morland
I use a script to restore a backup to create a testing copy of the
database. I set the following in postgresql.auto.conf:

recovery_target = 'immediate'
recovery_target_action = 'promote'

In the logs I get "recovery stopping after reaching consistency" then a
moment later "database system is ready to accept read-only connections",
then some entries about restoring log files, then "database system is ready
to accept connections".

I am able to make changes (e.g. CREATE TABLE), yet recovery.signal is still
present. My understanding is that recovery.signal should be removed when
recovery is finished (i.e., more or less when "database system is ready to
accept connections" is logged?), unless recovery_target_action is set to
'shutdown'.

Any ideas? Even just confirming/denying I understand the above correctly
would help.


User roles for gathering performance metrics data

2024-03-26 Thread Siraj G
Greetings!

I am from Oracle background. In Oracle, we grant select_catalog_role or
select any dictionary role to users who want to study performance data. I
am trying to get similar information on the roles or privileges in PgSQL
that we might want to request to investigate the performance problems?

Thanks
Siraj


Re: User roles for gathering performance metrics data

2024-03-26 Thread Christophe Pettus



> On Mar 26, 2024, at 22:30, Siraj G  wrote:
> I am from Oracle background. In Oracle, we grant select_catalog_role or 
> select any dictionary role to users who want to study performance data. I am 
> trying to get similar information on the roles or privileges in PgSQL that we 
> might want to request to investigate the performance problems?

You probably want pg_monitor:

https://www.postgresql.org/docs/current/predefined-roles.html



Re: User roles for gathering performance metrics data

2024-03-26 Thread David G. Johnston
On Tuesday, March 26, 2024, Siraj G  wrote:
>
>
> I am from Oracle background. In Oracle, we grant select_catalog_role or
> select any dictionary role to users who want to study performance data. I
> am trying to get similar information on the roles or privileges in PgSQL
> that we might want to request to investigate the performance problems?
>

PostgreSQL roles are defined here:

https://www.postgresql.org/docs/current/predefined-roles.html

David J.


Re: Active sessions does not terminated due to statement_timeout

2024-03-26 Thread Ц

Magnus,
 
PostgreSQL 14.6 on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 
10.2.1 20210110, 64-bit
 
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"  
It is docker container if it matters and yep, possibly slow and unreliable 
network also is an issue.
 
So you're assuming the statement_timeout parameter should have worked fo that 
and this is a bug?
 
  
>Вторник, 26 марта 2024, 17:43 +03:00 от Magnus Hagander :
> 
>   
>On Tue, Mar 26, 2024 at 3:19PM Ц < pf...@mail.ru > wrote:
>>Greetings!
>>I’ve faced with strange behavior when I see a lot of active sessions started 
>>hours ago while statement_timeout = '30min'.
>>All of them are fetching from cursors.
>> 
>>Typical session looks like:
>>backend_start    | 2024-03-26 14:34:20.552594+03
>>xact_start       | 2024-03-26 14:34:54.974628+03
>>query_start     | 2024-03-26 14:35:02.024133+03
>>state_change | 2024-03-26 14:35:02.024134+03
>>wait_event_type | Client
>>wait_event      | ClientWrite
>>state           | active
>>backend_xid   | 23240392
>>backend_xmin    | 23226474
>>query           | fetch all from ""
>>backend_type | client backend
>> 
>> 
>>They are accumulating up to tens by the end of the day with all negative 
>>impacts on performance.
>>Initially I thought that clients already died but due to network issues 
>>database considers them to be alive. So I set tcp_keepalive GUCs to nonzero 
>>values. Without success.
>>Then I checked connections from the app server side and found them in 
>>ESTABLISHED state.
>>It's certainly an application fault and it should not hold cursor 
>>forever...but
>> 
>>Is the any GUC parameters to fight with such «clients»?
>> 
> 
>I wonder if this might be the bug I saw in  
>https://www.postgresql.org/message-id/cabuevexbm_va9+iw0kgvuzbrlduz8vnl2wo2ig7jqqdgsy8...@mail.gmail.com
> -- basically that there's some path when we're in ClientWrite that it doesn't 
>check for interrupts properly. I've unfortunately not had time to dig into 
>that one anymore.
> 
>What version of PostgreSQL and what platform are you on? 
>  -- 
> Magnus Hagander
> Me:  https://www.hagander.net/
> Work:  https://www.redpill-linpro.com/