On Sat, Apr 30, 2022 at 6:19 PM Bharath Rupireddy <bharath.rupireddyforpostg...@gmail.com> wrote: > > On Mon, Nov 29, 2021 at 1:30 AM SATYANARAYANA NARLAPURAM > <satyanarlapu...@gmail.com> wrote: > > > > Hi Hackers, > > > > When the standby couldn't connect to the primary it switches the XLog > > source from streaming to archive and continues in that state until it can > > get the WAL from the archive location. On a server with high WAL activity, > > typically getting the WAL from the archive is slower than streaming it from > > the primary and couldn't exit from that state. This not only increases the > > lag on the standby but also adversely impacts the primary as the WAL gets > > accumulated, and vacuum is not able to collect the dead tuples. DBAs as a > > mitigation can however remove/advance the slot or remove the > > restore_command on the standby but this is a manual work I am trying to > > avoid. I would like to propose the following, please let me know your > > thoughts. > > > > Automatically attempt to switch the source from Archive to streaming when > > the primary_conninfo is set after replaying 'N' wal segment governed by the > > GUC retry_primary_conn_after_wal_segments > > when retry_primary_conn_after_wal_segments is set to -1 then the feature > > is disabled > > When the retry attempt fails, then switch back to the archive > > I've gone through the state machine in WaitForWALToBecomeAvailable and > I understand it this way: failed to receive WAL records from the > primary causes the current source to switch to archive and the standby > continues to get WAL records from archive location unless some failure > occurs there the current source is never going to switch back to > stream. Given the fact that getting WAL from archive location causes > delay in production environments, we miss to take the advantage of the > reconnection to primary after previous failed attempt. > > So basically, we try to attempt to switch to streaming from archive > (even though fetching from archive can succeed) after a certain amount > of time or WAL segments. I prefer timing-based switch to streaming > from archive instead of after a number of WAL segments fetched from > archive. Right now, wal_retrieve_retry_interval is being used to wait > before switching to archive after failed attempt from streaming, IMO, > a similar GUC (that gets set once the source switched from streaming > to archive and on timeout it switches to streaming again) can be used > to switch from archive to streaming after the specified amount of > time. > > Thoughts?
Here's a v1 patch that I've come up with. I'm right now using the existing GUC wal_retrieve_retry_interval to switch to stream mode from archive mode as opposed to switching only after the failure to get WAL from archive mode. If okay with the approach, I can add tests, change the docs and add a new GUC to control this behaviour. I'm open to thoughts and ideas here. Regards, Bharath Rupireddy.
v1-0001-Switch-to-stream-mode-from-archive-occasionally.patch
Description: Binary data