Accept recovery conflict interrupt on blocked writing

Anthonin Bonnefoy Mon, 13 Jan 2025 02:31:10 -0800

Hi,

I have replicas that have regular transient bursts of replay lag (>5
minutes). The events have the following symptoms:
- Replicas are using physical replication slot and hot_standby_feedback
- The lag recovers by itself after at most 15 minutes
- During the same timeframe, there's a query stuck in ClientWrite on
the replica for ~15 minutes (despite the 30s statement_timeout). The
matching client was terminated at the beginning when the server
started sending results and when the server process died, the replay
lag started to recover.


The 15 minutes timeout matches the default linux TCP retransmission
timeout[1]. A similar situation can be triggered by using psql
pipelining patch[2].

# On the primary, generate constant updates:
echo "\set aid random(1, 100000 * :scale)
UPDATE pgbench_accounts SET bid=bid+1 WHERE aid=:aid;" > update.sql
pgbench -f update.sql -T900

-- On a replica:
\startpipeline
-- At least 2 select are needed to completely saturate socket buffers
select * from pgbench_accounts \bind \g
select * from pgbench_accounts \bind \g
-- Flush the commands to the server
\flush

After that, the queries are sent to the server but the client doesn't
consume the results and the backend process should be stuck in a
ClientWrite state. Eventually, usually after ~10s, WAL replay becomes
completely blocked (sometimes I need to redo the pipeline query if
nothing happens). Looking at the backtrace, the recovery process is
blocked on ResolveRecoveryConflictWithBufferPin. The conflicting
buffer pin is held by the pipelined query, currently blocked trying to
write the result to the socket which is completely saturated. During
this time, all interrupts are ignored and WAL replay won't be able to
recover until the socket becomes either writable or ends with an
error.

The same situation happened on my instances where a backend process
was sending results to a client that died without sending a FIN or a
RST. The only way for this process to die is to reach the TCP
retransmission timeout after 15m. During this time, it can conflict
with the recovery as it holds a buffer pin, possibly blocking the
recovery for the whole duration.

To avoid blocking recovery for an extended period of time, this patch
changes client write interrupts by handling recovery conflict
interrupts instead of ignoring them. Since the interrupt happens while
we're likely to have partially written results on the socket, there's
no easy way to keep protocol sync so the session needs to be
terminated.

Setting tcp_user_timeout is also a possible mitigation for users, but
my assumption of how conflict recovery is done is that it is not
desirable to block recovery for an extended period of time and it is
fine to be aggressive when the standby delay is exceeded.

[1]: https://pracucci.com/linux-tcp-rto-min-max-and-tcp-retries2.html
[2]: https://commitfest.postgresql.org/51/5407/

Regards,
Anthonin Bonnefoy

v01-0001-Accept-recovery-conflict-interrupt-on-blocked-wr.patch
Description: Binary data

Accept recovery conflict interrupt on blocked writing

Reply via email to