On Fri, Dec 19, 2025 at 10:42 AM Michael Paquier <[email protected]> wrote: > > Some colleagues have reported that it is possible to finish with > orphaned records in pg_replication_origin_status, as an effect of > table synchronization workers that miss some cleanup actions around > replorigin_drop_by_name() when a subscription that spawned these > workers (which themselves create origins through replorigin_create() > called in LogicalRepSyncTableStart()) is dropped. > > Please find attached a small script, courtesy of Tenglong Gu and > Daisuke Higuchi, that have been digging into all that. > > This problem is true since ce0fdbfe9722 that has added the creation of > a replication origin for table synchronization workers, as of 14 up to > HEAD. One issue with these origin states is that they are sticky: a > restart of the node that held the subscription keeps them around due > to them getting flushed in replorigin_checkpoint. If my understanding > is right, this also means that origin slots are still tracked as in > use, implying that replorigin_session_setup() would not be able to > reuse them as far as I understand. > > So, in summary (please correct me if necessary), DropSubscription() > attempts a round of cleanup for all the replication origins with a > loop over ReplicationOriginNameForLogicalRep(), and while in this case > we attempt to drop the origins created by the workers that are tracked > as orphaned, the syscache lookup of REPLORIGIDENT fails within the > DROP SUBSCRIPTION because replorigin_by_name() cannot find an entry: > the replication origin is already gone, but its state has persisted in > memory. Then, why would the replication origin be already gone with > its state in memory not cleaned up yet? Well, the problematic test > case shows that the subscription is dropped while the spawned > tablesync workers do the initial table copy, and the replication > origin is created in the same transaction as the COPY. DROP > SUBSCRIPTION tells the tablesync workers to stop, they stop with the > COPY failing and the origin is not seen as something that exists for > the session that drops the subscription. The replication state in > memory goes out of sync. > > So it looks like to me that we are missing a code path where the > replication origin is dropped but we just ignore to reset the > replication state in memory, leading to bloat of the origin slots > available. worker.c does things differently: a small transaction is > used for the origin creation, hence we would never really see the > replication state memory going out-of-sync with the replorigin > catalog. > > One solution would be for tablesync.c to do the same as worker.c: we > could use a short transaction that sets the memory state and creates > the replication origin, then move to a second transaction for the > COPY. A second solution, slightly more complicated, is to create some > specific logic to reset the progress state of the origin that's been > created in the transaction that runs the initial COPY, which I guess > should be in the shape of a transaction abort callback. All these > symptoms are pointing out that it is not a good idea, IMO, to expect a > replication origin to exist when dropping a subscription when an > origin has been created in the context of a transaction that can take > a very long time to run, aka the initial tablesync's COPY, so using a > short transaction feels like a good thing to do here. >
Yeah, either of these ways are okay. The only minor point in creating replication origin in a separate transaction is that the other part of operations on origin still need be done at the current place, so the origin handling will look a bit separated. So, I have a slight preference towards PG_ENSURE_ERROR_CLEANUP solution where the callback should clear origin state via replorigin_state_clear. -- With Regards, Amit Kapila.
