Hi, I have replicas that have regular transient bursts of replay lag (>5 minutes). The events have the following symptoms: - Replicas are using physical replication slot and hot_standby_feedback - The lag recovers by itself after at most 15 minutes - During the same timeframe, there's a query stuck in ClientWrite on the replica for ~15 minutes (despite the 30s statement_timeout). The matching client was terminated at the beginning when the server started sending results and when the server process died, the replay lag started to recover.
The 15 minutes timeout matches the default linux TCP retransmission timeout[1]. A similar situation can be triggered by using psql pipelining patch[2]. # On the primary, generate constant updates: echo "\set aid random(1, 100000 * :scale) UPDATE pgbench_accounts SET bid=bid+1 WHERE aid=:aid;" > update.sql pgbench -f update.sql -T900 -- On a replica: \startpipeline -- At least 2 select are needed to completely saturate socket buffers select * from pgbench_accounts \bind \g select * from pgbench_accounts \bind \g -- Flush the commands to the server \flush After that, the queries are sent to the server but the client doesn't consume the results and the backend process should be stuck in a ClientWrite state. Eventually, usually after ~10s, WAL replay becomes completely blocked (sometimes I need to redo the pipeline query if nothing happens). Looking at the backtrace, the recovery process is blocked on ResolveRecoveryConflictWithBufferPin. The conflicting buffer pin is held by the pipelined query, currently blocked trying to write the result to the socket which is completely saturated. During this time, all interrupts are ignored and WAL replay won't be able to recover until the socket becomes either writable or ends with an error. The same situation happened on my instances where a backend process was sending results to a client that died without sending a FIN or a RST. The only way for this process to die is to reach the TCP retransmission timeout after 15m. During this time, it can conflict with the recovery as it holds a buffer pin, possibly blocking the recovery for the whole duration. To avoid blocking recovery for an extended period of time, this patch changes client write interrupts by handling recovery conflict interrupts instead of ignoring them. Since the interrupt happens while we're likely to have partially written results on the socket, there's no easy way to keep protocol sync so the session needs to be terminated. Setting tcp_user_timeout is also a possible mitigation for users, but my assumption of how conflict recovery is done is that it is not desirable to block recovery for an extended period of time and it is fine to be aggressive when the standby delay is exceeded. [1]: https://pracucci.com/linux-tcp-rto-min-max-and-tcp-retries2.html [2]: https://commitfest.postgresql.org/51/5407/ Regards, Anthonin Bonnefoy
v01-0001-Accept-recovery-conflict-interrupt-on-blocked-wr.patch
Description: Binary data