On Sat, Apr 30, 2022 at 6:19 PM Bharath Rupireddy
<bharath.rupireddyforpostg...@gmail.com> wrote:
>
> On Mon, Nov 29, 2021 at 1:30 AM SATYANARAYANA NARLAPURAM
> <satyanarlapu...@gmail.com> wrote:
> >
> > Hi Hackers,
> >
> > When the standby couldn't connect to the primary it switches the XLog 
> > source from streaming to archive and continues in that state until it can 
> > get the WAL from the archive location. On a server with high WAL activity, 
> > typically getting the WAL from the archive is slower than streaming it from 
> > the primary and couldn't exit from that state. This not only increases the 
> > lag on the standby but also adversely impacts the primary as the WAL gets 
> > accumulated, and vacuum is not able to collect the dead tuples. DBAs as a 
> > mitigation can however remove/advance the slot or remove the 
> > restore_command on the standby but this is a manual work I am trying to 
> > avoid. I would like to propose the following, please let me know your 
> > thoughts.
> >
> > Automatically attempt to switch the source from Archive to streaming when 
> > the primary_conninfo is set after replaying 'N' wal segment governed by the 
> > GUC retry_primary_conn_after_wal_segments
> > when  retry_primary_conn_after_wal_segments is set to -1 then the feature 
> > is disabled
> > When the retry attempt fails, then switch back to the archive
>
> I've gone through the state machine in WaitForWALToBecomeAvailable and
> I understand it this way: failed to receive WAL records from the
> primary causes the current source to switch to archive and the standby
> continues to get WAL records from archive location unless some failure
> occurs there the current source is never going to switch back to
> stream. Given the fact that getting WAL from archive location causes
> delay in production environments, we miss to take the advantage of the
> reconnection to primary after previous failed attempt.
>
> So basically, we try to attempt to switch to streaming from archive
> (even though fetching from archive can succeed) after a certain amount
> of time or WAL segments. I prefer timing-based switch to streaming
> from archive instead of after a number of WAL segments fetched from
> archive. Right now, wal_retrieve_retry_interval is being used to wait
> before switching to archive after failed attempt from streaming, IMO,
> a similar GUC (that gets set once the source switched from streaming
> to archive and on timeout it switches to streaming again) can be used
> to switch from archive to streaming after the specified amount of
> time.
>
> Thoughts?

Here's a v1 patch that I've come up with. I'm right now using the
existing GUC wal_retrieve_retry_interval to switch to stream mode from
archive mode as opposed to switching only after the failure to get WAL
from archive mode. If okay with the approach, I can add tests, change
the docs and add a new GUC to control this behaviour. I'm open to
thoughts and ideas here.

Regards,
Bharath Rupireddy.

Attachment: v1-0001-Switch-to-stream-mode-from-archive-occasionally.patch
Description: Binary data

Reply via email to