Hi Standby does not start walreceiver process until startup process finishes WAL replay. The more WAL there is to replay, longer is the delay in starting streaming replication. If replication connection is temporarily disconnected, this delay becomes a major problem and we are proposing a solution to avoid the delay.
WAL replay is likely to fall behind when master is processing write-heavy workload, because WAL is generated by concurrently running backends on master while only one startup process on standby replays WAL records in sequence as new WAL is received from master. Replication connection between walsender and walreceiver may break due to reasons such as transient network issue, standby going through restart, etc. The delay in resuming replication connection leads to lack of high availability - only one copy of WAL is available during this period. The problem worsens when the replication is configured to be synchronous. Commits on master must wait until the WAL replay is finished on standby, walreceiver is then started and it confirms flush of WAL upto the commit LSN. If synchronous_commit GUC is set to remote_write, this behavior is equivalent to tacitly changing it to remote_apply until the replication connection is re-established! Has anyone encountered such a problem with streaming replication? We propose to address this by starting walreceiver without waiting for startup process to finish replay of WAL. Please see attached patchset. It can be summarized as follows: 0001 - TAP test to demonstrate the problem. 0002 - The standby startup sequence is changed such that walreceiver is started by startup process before it begins to replay WAL. 0003 - Postmaster starts walreceiver if it finds that a walreceiver process is no longer running and the state indicates that it is operating as a standby. This is a POC, we are looking for early feedback on whether the problem is worth solving and if it makes sense to solve if along this route. Hao and Asim
0001-Test-that-replay-of-WAL-logs-on-standby-does-not-aff.patch
Description: Binary data
0003-Start-WAL-receiver-when-it-is-found-not-running.patch
Description: Binary data
0002-Start-WAL-receiver-before-startup-process-replays-ex.patch
Description: Binary data