On Tue, Apr 22, 2025 at 12:36 PM Zhijie Hou (Fujitsu) <houzj.f...@fujitsu.com> wrote: > > When analyzing some issues in another thread[1], I found a bug that fast > forward > decoding could lead to premature advancement of catalog_xmin, resulting in > required catalog data being removed by vacuum. > > The issue arises because we do not build a base snapshot when decoding changes > during fast forward mode, preventing reference to the minimum transaction ID > that remains visible in the snapshot when determining the candidate for > catalog_xmin. As a result, catalog_xmin was directly advanced to the oldest > running transaction ID found in the latest running_xacts record. > > In code-level, SnapBuildProcessChange -> ReorderBufferSetBaseSnapshot could > not > be reached during fast forward decoding, resulting in > rb->txns_by_base_snapshot_lsn being NULL. When advancing catalog_xmin, the > system attempts to refer to rb->txns_by_base_snapshot_lsn via > ReorderBufferGetOldestXmin(). However, since rb->txns_by_base_snapshot_lsn is > NULL, it defaults to directly using running->oldestRunningXid as the candidate > for catalog_xmin. For reference, see the details in > SnapBuildProcessRunningXacts. > > See the attachment for a test(0002) to prove that the catalog data that are > still required would be removed after premature catalog_xmin advancement > during > fast forward decoding. > > To fix this, I think we can allow the base snapshot to be built during fast > forward decoding, as implemented in the patch 0001 (We already built base > snapshot in fast-forward mode for logical message in logicalmsg_decode()). >
The same bug was fixed for non-fast_forward mode in commit f49a80c4. See the following code in that commit: - LogicalIncreaseXminForSlot(lsn, running->oldestRunningXid); + xmin = ReorderBufferGetOldestXmin(builder->reorder); + if (xmin == InvalidTransactionId) + xmin = running->oldestRunningXid; .... + LogicalIncreaseXminForSlot(lsn, xmin); Is my understanding correct? If so, then I think it is a miss in the commit f49a80c4 to consider fast_forward mode. Please refer commit in the commit message. The code changes look correct to me. I have some suggestions related to comments in the code. See attached. Please prepare patches for back branches as well. -- With Regards, Amit Kapila.
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c index 8dc467a8bb3..cc03f0706e9 100644 --- a/src/backend/replication/logical/decode.c +++ b/src/backend/replication/logical/decode.c @@ -413,10 +413,10 @@ heap2_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) /* * If we don't have snapshot or we are just fast-forwarding, there is no * point in decoding data changes. However, it's crucial to build the base - * snapshot during fast-forward mode (as done in SnapBuildProcessChange()) - * because the snapshot's minimum visible transaction ID (xmin) must be - * referenced when determining the candidate catalog_xmin for the - * replication slot. + * snapshot during fast-forward mode (as is done in + * SnapBuildProcessChange()) because we require the snapshot's xmin when + * determining the candidate catalog_xmin for the replication slot. See + * SnapBuildProcessRunningXacts(). */ if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT) return; @@ -477,10 +477,10 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) /* * If we don't have snapshot or we are just fast-forwarding, there is no * point in decoding data changes. However, it's crucial to build the base - * snapshot during fast-forward mode (as done in SnapBuildProcessChange()) - * because the snapshot's minimum visible transaction ID (xmin) must be - * referenced when determining the candidate catalog_xmin for the - * replication slot. + * snapshot during fast-forward mode (as is done in + * SnapBuildProcessChange()) because we require the snapshot's xmin when + * determining the candidate catalog_xmin for the replication slot. See + * SnapBuildProcessRunningXacts(). */ if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT) return;