On Wed, May 28, 2025 at 2:09 AM Masahiko Sawada wrote: > > On Fri, May 23, 2025 at 10:07 PM Amit Kapila <amit.kapil...@gmail.com> > wrote: > > > > In the case presented here, the logical slot is expected to keep > > forwarding, and in the consecutive sync cycle, the sync should be > > successful. Users using logical decoding APIs should also be aware > > that if due for some reason, the logical slot is not moving forward, > > the master/publisher node will start accumulating dead rows and WAL, > > which can create bigger problems. > > I've tried this case and am concerned that the slot synchronization using > pg_sync_replication_slots() would never succeed while the primary keeps > getting write transactions. Even if the user manually consumes changes on the > primary, the primary server keeps advancing its XID in the meanwhile. On the > standby, we ensure that the > TransamVariables->nextXid is beyond the XID of WAL record that it's > going to apply so the xmin horizon calculated by > GetOldestSafeDecodingTransactionId() ends up always being higher than the > slot's catalog_xmin on the primary. We get the log message "could not > synchronize replication slot "s" because remote slot precedes local slot" and > cleanup the slot on the standby at the end of pg_sync_replication_slots().
I think the issue occurs because unlike the slotsync worker, the SQL API removes temporary slots when the function ends, so it cannot hold back the standby's catalog_xmin. If transactions on the primary keep advancing xids, the source slot's catalog_xmin on the primary fails to catch up with the standby's nextXid, causing sync failure. We chose this behavior because we could not predict when (or if) the SQL function might be executed again, and the creating session might persist after promotion. Without automatic cleanup, this could lead to temporary slots being retained for a longer time. This only affects the initial sync when creating a new slot on the standby. Once the slot exists, the standby's catalog_xmin stabilizes, preventing the issue in subsequent syncs. I think the SQL API was mainly intended for testing and debugging purposes where controlled sync operations are useful. For production use, the slotsync worker (with sync_replication_slots=on) is recommended because it automatically handles this problem and requires minimal manual intervention. But to avoid confusion, I think we should clearly document this distinction. Best Regards, Hou zj