On Wed, May 28, 2025 at 2:09 AM Masahiko Sawada wrote:
> 
> On Fri, May 23, 2025 at 10:07 PM Amit Kapila <amit.kapil...@gmail.com>
> wrote:
> >
> > In the case presented here, the logical slot is expected to keep
> > forwarding, and in the consecutive sync cycle, the sync should be
> > successful. Users using logical decoding APIs should also be aware
> > that if due for some reason, the logical slot is not moving forward,
> > the master/publisher node will start accumulating dead rows and WAL,
> > which can create bigger problems.
> 
> I've tried this case and am concerned that the slot synchronization using
> pg_sync_replication_slots() would never succeed while the primary keeps
> getting write transactions. Even if the user manually consumes changes on the
> primary, the primary server keeps advancing its XID in the meanwhile. On the
> standby, we ensure that the
> TransamVariables->nextXid is beyond the XID of WAL record that it's
> going to apply so the xmin horizon calculated by
> GetOldestSafeDecodingTransactionId() ends up always being higher than the
> slot's catalog_xmin on the primary. We get the log message "could not
> synchronize replication slot "s" because remote slot precedes local slot" and
> cleanup the slot on the standby at the end of pg_sync_replication_slots().

I think the issue occurs because unlike the slotsync worker, the SQL API
removes temporary slots when the function ends, so it cannot hold back the
standby's catalog_xmin. If transactions on the primary keep advancing xids, the
source slot's catalog_xmin on the primary fails to catch up with the standby's
nextXid, causing sync failure.
 
We chose this behavior because we could not predict when (or if) the SQL
function might be executed again, and the creating session might persist after
promotion. Without automatic cleanup, this could lead to temporary slots being
retained for a longer time.
 
This only affects the initial sync when creating a new slot on the standby.
Once the slot exists, the standby's catalog_xmin stabilizes, preventing the
issue in subsequent syncs.
 
I think the SQL API was mainly intended for testing and debugging purposes
where controlled sync operations are useful. For production use, the slotsync
worker (with sync_replication_slots=on) is recommended because it automatically
handles this problem and requires minimal manual intervention. But to avoid
confusion, I think we should clearly document this distinction.

Best Regards,
Hou zj

Reply via email to