Hi,

While testing the replication slot sync feature, I observed that the slotsync
worker can suddenly start emitting four log messages every 200 ms, even when
both primary and standby are idle. For example:

--------------------------
2026-02-27 23:06:46.123 JST [80789] LOG:  starting logical decoding
for slot "logical_slot"
2026-02-27 23:06:46.123 JST [80789] DETAIL:  Streaming transactions
committing after 0/03000140, reading WAL from 0/03000098.
2026-02-27 23:06:46.123 JST [80789] LOG:  logical decoding found
consistent point at 0/03000098
2026-02-27 23:06:46.123 JST [80789] DETAIL:  There are no running transactions.
2026-02-27 23:06:46.330 JST [80789] LOG:  starting logical decoding
for slot "logical_slot"
2026-02-27 23:06:46.330 JST [80789] DETAIL:  Streaming transactions
committing after 0/03000140, reading WAL from 0/03000098.
2026-02-27 23:06:46.330 JST [80789] LOG:  logical decoding found
consistent point at 0/03000098
2026-02-27 23:06:46.330 JST [80789] DETAIL:  There are no running transactions.
2026-02-27 23:06:46.536 JST [80789] LOG:  starting logical decoding
for slot "logical_slot"
2026-02-27 23:06:46.536 JST [80789] DETAIL:  Streaming transactions
committing after 0/03000140, reading WAL from 0/03000098.
2026-02-27 23:06:46.536 JST [80789] LOG:  logical decoding found
consistent point at 0/03000098
2026-02-27 23:06:46.536 JST [80789] DETAIL:  There are no running transactions.
--------------------------

These messages repeat roughly every 200 ms.


I created the replication slot sync environment as follows:

--------------------------
initdb -D data --encoding=UTF8 --locale=C
cat <<EOF >> data/postgresql.conf
wal_level = logical
synchronized_standby_slots = 'physical_slot'
EOF
pg_ctl -D data start
pg_receivewal --create-slot -S physical_slot
pg_recvlogical --create-slot -S logical_slot -P pgoutput
--enable-failover -d postgres
psql -c "CREATE PUBLICATION mypub"

pg_basebackup -D sby1 -c fast -R -S physical_slot -d "dbname=postgres"
cat <<EOF >> sby1/postgresql.conf
port = 5433
sync_replication_slots = on
hot_standby_feedback = on
EOF
pg_ctl -D sby1 start
--------------------------


After that, I executed the following, and then the issue occurred:

--------------------------
SELECT pg_logical_emit_message(true, 'abc', 'xyz');

SELECT pg_replication_slot_advance('logical_slot', max(lsn)) FROM
pg_logical_slot_peek_binary_changes('logical_slot', NULL, NULL,
'proto_version', '3', 'publication_names', 'mypub', 'messages',
'true', 'binary', 'false', 'streaming', 'false');

-- Wait for the log message "newly created replication slot
"logical_slot" is sync-ready now" to output

SELECT pg_replication_slot_advance('logical_slot', max(lsn)) FROM
pg_logical_slot_peek_binary_changes('logical_slot', NULL, NULL,
'proto_version', '3', 'publication_names', 'mypub', 'messages',
'true', 'binary', 'false', 'streaming', 'false');
--------------------------


While the issue is happening, the failover logical slot shows:

[PRIMARY]
=# SELECT slot_name, restart_lsn, confirmed_flush_lsn from
pg_replication_slots where slot_name = 'logical_slot';
  slot_name   | restart_lsn | confirmed_flush_lsn
--------------+-------------+---------------------
 logical_slot | 0/03000140  | 0/03000140

[STANDBY]
=# SELECT slot_name, restart_lsn, confirmed_flush_lsn from
pg_replication_slots where slot_name = 'logical_slot';
  slot_name   | restart_lsn | confirmed_flush_lsn
--------------+-------------+---------------------
 logical_slot | 0/03000098  | 0/03000140

confirmed_flush_lsn matches on both servers, but restart_lsn differs.


Normally, the slotsync worker updates the standby slot using the primary's slot
state. However, when confirmed_flush_lsn matches but restart_lsn does not,
the worker does not actually update the standby slot. Despite that, the current
code of update_local_synced_slot() appears to treat this situation as if
an update occurred. As a result, the worker sleeps only for the minimum
interval (200 ms) before retrying. In the next cycle, it again assumes
an update happened, and continues looping with the short sleep interval,
causing the repeated logical decoding log messages. Based on a quick analysis,
this seems to be the root cause.

I think update_local_synced_slot() should return false (i.e., no update
happened) when confirmed_flush_lsn is equal but restart_lsn differs between
primary and standby. That would allow the worker to use the normal sleep
interval instead of the minimum one.

I've attached a PoC patch implementing this change.

Thoughts?

Regards,

-- 
Fujii Masao

Attachment: v1-0001-Fix-slotsync-worker-busy-loop-causing-repeated-lo.patch
Description: Binary data

Reply via email to