Executive summary: the "MyWalSnd->write < sentPtr" in WalSndWaitForWal() is
important for promptly updating pg_stat_replication. When caught up, we
should impose that logic before every sleep. The one-line fix is to sleep in
WalSndLoop() only when pq_is_send_pending(), not when caught up.
On my regular development machine, src/test/subscription/t/001_rep_changes.pl
stalls for ~10s at this wait_for_catchup:
$node_publisher->safe_psql('postgres', "DELETE FROM tab_rep");
# Restart the publisher and check the state of the subscriber which
# should be in a streaming state after catching up.
$node_publisher->stop('fast');
$node_publisher->start;
$node_publisher->wait_for_catchup('tap_sub');
That snippet emits three notable physical WAL records. There's a
Transaction/COMMIT at the end of the DELETE, an XLOG/CHECKPOINT_SHUTDOWN, and
an XLOG/FPI_FOR_HINT.
The buildfarm has stalled there, but it happens probably less than half the
time. Examples[1] showing the stall:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=mandrill&dt=2020-03-20%2017%3A09%3A53&stg=subscription-check
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=thorntail&dt=2020-03-22%2019%3A51%3A38&stg=subscription-check
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hoverfly&dt=2020-03-19%2003%3A35%3A01&stg=subscription-check
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hoverfly&dt=2020-03-20%2015%3A15%3A01&stg=subscription-check
Here's the most-relevant walsender call tree:
WalSndLoop
XLogSendLogical (caller invokes once per loop iteration, via send_data
callback)
XLogReadRecord (caller invokes once)
ReadPageInternal (caller invokes twice in this test; more calls are
possible)
logical_read_xlog_page (caller skips when page is same as last call,
else invokes 1-2 times via state->read_page() callback, registered in
StartLogicalReplication)
WalSndWaitForWal (caller invokes once; has fast path)
The cause is a race involving the flow of reply messages (send_feedback()
messages) from logical apply worker to walsender. Here are two sequencing
patterns; the more-indented parts are what differ. Stalling pattern:
sender reads Transaction/COMMIT and sends the changes
receiver applies the changes
receiver send_feedback() reports progress up to Transaction/COMMIT
sender accepts the report
sender reads XLOG/CHECKPOINT_SHUTDOWN and/or XLOG/FPI_FOR_HINT, which are
no-ops for logical rep
sender WalSndCaughtUp becomes true; sender sleeps in WalSndLoop()
receiver wal_receiver_status_interval elapses; receiver reports progress up
to Transaction/COMMIT
sender wakes up, accepts the report
sender calls WalSndWaitForWal(), which sends a keepalive due to
"MyWalSnd->write < sentPtr"
receiver gets keepalive, send_feedback() reports progress up to
XLOG/FPI_FOR_HINT
Non-stalling pattern (more prevalent with lower machine performance):
sender reads Transaction/COMMIT and sends the changes
sender reads XLOG/CHECKPOINT_SHUTDOWN and/or XLOG/FPI_FOR_HINT, which are
no-ops for logical rep
sender WalSndCaughtUp becomes true; sender sleeps in WalSndLoop()
receiver applies the changes
receiver send_feedback() reports progress up to Transaction/COMMIT
sender wakes up, accepts the report
sender calls WalSndWaitForWal(), which sends a keepalive due to
"MyWalSnd->write < sentPtr"
receiver gets keepalive, send_feedback() reports progress up to
XLOG/FPI_FOR_HINT
The fix is to test "MyWalSnd->write < sentPtr" before more sleeps. The test
is unnecessary when sleeping due to pq_is_send_pending(); in that case, the
receiver is not idle and will reply before idling. I changed WalSndLoop() to
sleep only for pq_is_send_pending(). For all other sleep reasons, the sleep
will happen in WalSndWaitForWal(). Attached. I don't know whether this is
important outside of testing scenarios. I lean against back-patching, but I
will back-patch if someone thinks this qualifies as a performance bug.
Thanks,
nm
[1] I spot-checked only my animals, since I wanted to experiment on an
affected animal.
Author: Noah Misch <[email protected]>
Commit: Noah Misch <[email protected]>
When WalSndCaughtUp, sleep only in WalSndWaitForWal().
Before sleeping, WalSndWaitForWal() sends a keepalive if MyWalSnd->write
< sentPtr. That is important in logical replication. When the latest
physical LSN yields no logical replication messages (a common case),
that keepalive elicits a reply, and processing the reply updates
pg_stat_replication.replay_lsn. WalSndLoop() lacks that; when
WalSndLoop() slept, replay_lsn advancement could stall until
wal_receiver_status_interval elapsed. This sometimes stalled
src/test/subscription/t/001_rep_changes.pl for up to 10s.
Reviewed by FIXME.
Discussion: https://postgr.es/m/FIXME
diff --git a/src/backend/replication/walsender.c
b/src/backend/replication/walsender.c
index 9e56115..d9c6359 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1428,8 +1428,10 @@ WalSndWaitForWal(XLogRecPtr loc)
/*
* We only send regular messages to the client for full decoded
* transactions, but a synchronous replication and walsender
shutdown
- * possibly are waiting for a later location. So we send pings
- * containing the flush location every now and then.
+ * possibly are waiting for a later location. So, before
sleeping, we
+ * send a ping containing the flush location. If the receiver is
+ * otherwise idle, this keepalive will trigger a reply.
Processing the
+ * reply will update these MyWalSnd locations.
*/
if (MyWalSnd->flush < sentPtr &&
MyWalSnd->write < sentPtr &&
@@ -2314,20 +2316,16 @@ WalSndLoop(WalSndSendDataCallback send_data)
WalSndKeepaliveIfNecessary();
/*
- * We don't block if not caught up, unless there is unsent data
- * pending in which case we'd better block until the socket is
- * write-ready. This test is only needed for the case where the
- * send_data callback handled a subset of the available data
but then
- * pq_flush_if_writable flushed it all --- we should
immediately try
- * to send more.
+ * Block if we have unsent data. Let WalSndWaitForWal() handle
any
+ * other blocking; idle receivers need its additional actions.
*/
- if ((WalSndCaughtUp && !streamingDoneSending) ||
pq_is_send_pending())
+ if (pq_is_send_pending())
{
long sleeptime;
int wakeEvents;
wakeEvents = WL_LATCH_SET | WL_EXIT_ON_PM_DEATH |
WL_TIMEOUT |
- WL_SOCKET_READABLE;
+ WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE;
/*
* Use fresh timestamp, not last_processing, to reduce
the chance
@@ -2335,9 +2333,6 @@ WalSndLoop(WalSndSendDataCallback send_data)
*/
sleeptime =
WalSndComputeSleeptime(GetCurrentTimestamp());
- if (pq_is_send_pending())
- wakeEvents |= WL_SOCKET_WRITEABLE;
-
/* Sleep until something happens or we time out */
(void) WaitLatchOrSocket(MyLatch, wakeEvents,
MyProcPort->sock, sleeptime,