On Monday, February 17, 2025 7:31 PM Shlok Kyal <shlok.kyal....@gmail.com> wrote: > > On Thu, 13 Feb 2025 at 15:54, vignesh C <vignes...@gmail.com> wrote: > > > > On Tue, 4 Feb 2025 at 15:27, Shlok Kyal <shlok.kyal....@gmail.com> wrote: > > > > > > Hi, > > > > > > Currently, we can copy an invalidated slot using the function > > > 'pg_copy_logical_replication_slot'. As per the suggestion in the > > > thread [1], we should prohibit copying of such slots. > > > > > > I have created a patch to address the issue. > > > > This patch does not fix all the copy_replication_slot scenarios > > completely, there is a very corner concurrency case where an > > invalidated slot still gets copied: > > + /* We should not copy invalidated replication slots */ > > + if (src_isinvalidated) > > + ereport(ERROR, > > + > > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > + errmsg("cannot copy an invalidated > > replication slot"))); > > > > Consider the following scenario: > > step 1) Set up streaming replication between the primary and standby nodes. > > step 2) Create a logical replication slot (test1) on the standby node. > > step 3) Have a breakpoint in InvalidatePossiblyObsoleteSlot if cause > > is RS_INVAL_WAL_LEVEL, no need to hold other invalidation causes or > > add a sleep in InvalidatePossiblyObsoleteSlot function like below: > > if (cause == RS_INVAL_WAL_LEVEL) > > { > > while (bsleep) > > sleep(1); > > } > > step 4) Reduce wal_level on the primary to replica and restart the primary > node. > > step 5) SELECT 'copy' FROM pg_copy_logical_replication_slot('test1', > > 'test2'); -- It will wait till the lock held by > > InvalidatePossiblyObsoleteSlot is released while trying to create a > > slot. > > step 6) Increase wal_level back to logical on the primary node and > > restart the primary. > > step 7) Now allow the invalidation to happen (continue the breakpoint > > held at step 3), the replication control lock will be released and the > > invalidated slot will be copied > > > > After this: > > postgres=# SELECT 'copy' FROM > > pg_copy_logical_replication_slot('test1', 'test2'); ?column? > > ---------- > > copy > > (1 row) > > > > -- The invalidated slot (test1) is copied successfully: > > postgres=# select * from pg_replication_slots ; > > slot_name | plugin | slot_type | datoid | database | temporary > > | active | active_pid | xmin | catalog_xmin | restart_lsn | > > confirmed_flush_lsn | wal_status | safe_wal_size | two_phas > > e | inactive_since | conflicting | > > invalidation_reason | failover | synced > > > -----------+---------------+-----------+--------+----------+-----------+ > --------+------------+------+--------------+-------------+--------------- > ------+------------+---------------+--------- > > > --+----------------------------------+-------------+---------------------- > --+----------+-------- > > test1 | test_decoding | logical | 5 | postgres | f > > | f | | | 745 | 0/4029060 | 0/4029098 > > | lost | | f > > | 2025-02-13 15:26:54.666725+05:30 | t | > > wal_level_insufficient | f | f > > test2 | test_decoding | logical | 5 | postgres | f > > | f | | | 745 | 0/4029060 | 0/4029098 > > | reserved | | f > > | 2025-02-13 15:30:30.477836+05:30 | f | > > | f | f > > (2 rows) > > > > -- A subsequent attempt to decode changes from the invalidated slot > > (test2) fails: > > postgres=# SELECT data FROM pg_logical_slot_get_changes('test2', NULL, > > NULL); > > WARNING: detected write past chunk end in TXN 0x5e77e6c6f300 > > ERROR: logical decoding on standby requires "wal_level" >= "logical" > > on the primary > > > > -- Alternatively, the following error may occur: > > postgres=# SELECT data FROM pg_logical_slot_get_changes('test2', NULL, > > NULL); > > WARNING: detected write past chunk end in TXN 0x582d1b2d6ef0 > > data > > ------------ > > BEGIN 744 > > COMMIT 744 > > (2 rows) > > > > This is an edge case that can occur under specific conditions > > involving replication slot invalidation when there is a huge lag > > between primary and standby. > > There might be a similar concurrency case for wal_removed too. > > > > Hi Vignesh, > > Thanks for reviewing the patch.
Thanks for updating the patch. I have a question related to it. > > I have tested the above scenario and was able to reproduce it. I have fixed > it in > the v2 patch. > Currently we are taking a shared lock on ReplicationSlotControlLock. > This issue can be resolved if we take an exclusive lock instead. > Thoughts? It's not clear to me why increasing the lock level can solve it, could you elaborate a bit more on this ? Besides, do we need one more invalidated check in the following codes after creating the slot ? /* * Check if the source slot still exists and is valid. We regard it as * invalid if the type of replication slot or name has been changed, * or the restart_lsn either is invalid or has gone backward. (The ... Best Regards, Hou zj