On Thu, 13 Feb 2025 at 15:54, vignesh C <vignes...@gmail.com> wrote: > > On Tue, 4 Feb 2025 at 15:27, Shlok Kyal <shlok.kyal....@gmail.com> wrote: > > > > Hi, > > > > Currently, we can copy an invalidated slot using the function > > 'pg_copy_logical_replication_slot'. As per the suggestion in the > > thread [1], we should prohibit copying of such slots. > > > > I have created a patch to address the issue. > > This patch does not fix all the copy_replication_slot scenarios > completely, there is a very corner concurrency case where an > invalidated slot still gets copied: > + /* We should not copy invalidated replication slots */ > + if (src_isinvalidated) > + ereport(ERROR, > + > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot copy an invalidated > replication slot"))); > > Consider the following scenario: > step 1) Set up streaming replication between the primary and standby nodes. > step 2) Create a logical replication slot (test1) on the standby node. > step 3) Have a breakpoint in InvalidatePossiblyObsoleteSlot if cause > is RS_INVAL_WAL_LEVEL, no need to hold other invalidation causes or > add a sleep in InvalidatePossiblyObsoleteSlot function like below: > if (cause == RS_INVAL_WAL_LEVEL) > { > while (bsleep) > sleep(1); > } > step 4) Reduce wal_level on the primary to replica and restart the primary > node. > step 5) SELECT 'copy' FROM pg_copy_logical_replication_slot('test1', > 'test2'); -- It will wait till the lock held by > InvalidatePossiblyObsoleteSlot is released while trying to create a > slot. > step 6) Increase wal_level back to logical on the primary node and > restart the primary. > step 7) Now allow the invalidation to happen (continue the breakpoint > held at step 3), the replication control lock will be released and the > invalidated slot will be copied > > After this: > postgres=# SELECT 'copy' FROM > pg_copy_logical_replication_slot('test1', 'test2'); > ?column? > ---------- > copy > (1 row) > > -- The invalidated slot (test1) is copied successfully: > postgres=# select * from pg_replication_slots ; > slot_name | plugin | slot_type | datoid | database | temporary > | active | active_pid | xmin | catalog_xmin | restart_lsn | > confirmed_flush_lsn | wal_status | safe_wal_size | two_phas > e | inactive_since | conflicting | > invalidation_reason | failover | synced > -----------+---------------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+--------- > --+----------------------------------+-------------+------------------------+----------+-------- > test1 | test_decoding | logical | 5 | postgres | f > | f | | | 745 | 0/4029060 | 0/4029098 > | lost | | f > | 2025-02-13 15:26:54.666725+05:30 | t | > wal_level_insufficient | f | f > test2 | test_decoding | logical | 5 | postgres | f > | f | | | 745 | 0/4029060 | 0/4029098 > | reserved | | f > | 2025-02-13 15:30:30.477836+05:30 | f | > | f | f > (2 rows) > > -- A subsequent attempt to decode changes from the invalidated slot > (test2) fails: > postgres=# SELECT data FROM pg_logical_slot_get_changes('test2', NULL, NULL); > WARNING: detected write past chunk end in TXN 0x5e77e6c6f300 > ERROR: logical decoding on standby requires "wal_level" >= "logical" > on the primary > > -- Alternatively, the following error may occur: > postgres=# SELECT data FROM pg_logical_slot_get_changes('test2', NULL, NULL); > WARNING: detected write past chunk end in TXN 0x582d1b2d6ef0 > data > ------------ > BEGIN 744 > COMMIT 744 > (2 rows) > > This is an edge case that can occur under specific conditions > involving replication slot invalidation when there is a huge lag > between primary and standby. > There might be a similar concurrency case for wal_removed too. >
Hi Vignesh, Thanks for reviewing the patch. I have tested the above scenario and was able to reproduce it. I have fixed it in the v2 patch. Currently we are taking a shared lock on ReplicationSlotControlLock. This issue can be resolved if we take an exclusive lock instead. Thoughts? Thanks and Regards, Shlok Kyal
v2-0001-Restrict-copying-of-invalidated-replication-slots.patch
Description: Binary data