Hi all, Migrating a large database from 13 to 17 using a logical replication. Postgres - SaaS (GCP). Subscriber - 17.7. Publisher - 13.22.
Plan: 1. create publication [pub] 2. create logical replication slot [pub] 3. copy instance [pub -> sub] 4. extract the LSN of the "redo done" point from the log [sub] 5. upgrade the copied instance to version 17 [sub] 6. adjust the LSN of the replication slot (pg_replication_slot_advance) [pub] 7. create subscription [sub] There is a constant load on the Primary with inserts and updates. The problem is: After creating a Subscription, a transaction that had already been committed on the Subscriber during the restore, is applied again. As a result, getting PK violation on INSERT. If you compare LSNs, "redo done at" point is indeed before the COMMIT of the transaction. But the conflicting row already exists on the Subscriber. Meanwhile, if move *confirmed_flush_lsn* of the replication slot by 1 byte, replication goes further without issues. But again, formally this LSN is still before the COMMIT. Example: redo done at - *276/8FEE68E0*, which is before the last DML ( *276/8FEE6458*) and COMMIT (*276/8FEE6910*). Records around this COMMIT (*xid 13347755*) ordered by lsn: | lsn | older_than_redo | xid | operator | cmt | | 276/8FEE4BB0 | false | 13347755 | INSERT | | | 276/8FEE5FC0 | false | 13347755 | UPDATE | | | 276/8FEE6458 | false | 13347755 | UPDATE | | | 276/8FEE6910 | true | 13347755 | COMMIT | COMMIT 13347755 | | 276/8FEE6910 | true | 13347756 | BEGIN | BEGIN 13347756 | | 276/8FEE6910 | true | 13347756 | UPDATE | | | 276/8FEE6980 | true | 13347756 | INSERT | | (the second column is just "*lsn > '276/8FEE68E0'::pg_lsn*") And advanced by 1 byte slot position is "*276/8FEE68E1*"*.* Please help to understand what is going on: *1.* Why after copying the instance a transaction that has COMMIT LSN lower than *"Redo Done At*" point LSN appears to be already committed? *2.* Why shifting the *confirmed_flush_lsn* of the replication slot by *1 byte* fixes the problem? *3.* From the consistency perspective, how safe is this trick with advancing the replication slot position by 1 byte? Thank you, -- Pavel Suderevsky
