On Thu, 13 Feb 2025 at 15:54, vignesh C <vignes...@gmail.com> wrote:
>
> On Tue, 4 Feb 2025 at 15:27, Shlok Kyal <shlok.kyal....@gmail.com> wrote:
> >
> > Hi,
> >
> > Currently, we can copy an invalidated slot using the function
> > 'pg_copy_logical_replication_slot'. As per the suggestion in the
> > thread [1], we should prohibit copying of such slots.
> >
> > I have created a patch to address the issue.
>
> This patch does not fix all the copy_replication_slot scenarios
> completely, there is a very corner concurrency case where an
> invalidated slot still gets copied:
> +       /* We should not copy invalidated replication slots */
> +       if (src_isinvalidated)
> +               ereport(ERROR,
> +
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                                errmsg("cannot copy an invalidated
> replication slot")));
>
> Consider the following scenario:
> step 1) Set up streaming replication between the primary and standby nodes.
> step 2) Create a logical replication slot (test1) on the standby node.
> step 3) Have a breakpoint in InvalidatePossiblyObsoleteSlot if cause
> is RS_INVAL_WAL_LEVEL, no need to hold other invalidation causes or
> add a sleep in InvalidatePossiblyObsoleteSlot function like below:
> if (cause == RS_INVAL_WAL_LEVEL)
> {
> while (bsleep)
> sleep(1);
> }
> step 4) Reduce wal_level on the primary to replica and restart the primary 
> node.
> step 5) SELECT 'copy' FROM pg_copy_logical_replication_slot('test1',
> 'test2');  -- It will wait till the lock held by
> InvalidatePossiblyObsoleteSlot is released while trying to create a
> slot.
> step 6) Increase wal_level back to logical on the primary node and
> restart the primary.
> step 7) Now allow the invalidation to happen (continue the breakpoint
> held at step 3), the replication control lock will be released and the
> invalidated slot will be copied
>
> After this:
> postgres=# SELECT 'copy' FROM
> pg_copy_logical_replication_slot('test1', 'test2');
>  ?column?
> ----------
>  copy
> (1 row)
>
> -- The invalidated slot (test1) is copied successfully:
> postgres=# select * from pg_replication_slots ;
>  slot_name |    plugin     | slot_type | datoid | database | temporary
> | active | active_pid | xmin | catalog_xmin | restart_lsn |
> confirmed_flush_lsn | wal_status | safe_wal_size | two_phas
> e |          inactive_since          | conflicting |
> invalidation_reason   | failover | synced
> -----------+---------------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+---------
> --+----------------------------------+-------------+------------------------+----------+--------
>  test1     | test_decoding | logical   |      5 | postgres | f
> | f      |            |      |          745 | 0/4029060   | 0/4029098
>          | lost       |               | f
>   | 2025-02-13 15:26:54.666725+05:30 | t           |
> wal_level_insufficient | f        | f
>  test2     | test_decoding | logical   |      5 | postgres | f
> | f      |            |      |          745 | 0/4029060   | 0/4029098
>          | reserved   |               | f
>   | 2025-02-13 15:30:30.477836+05:30 | f           |
>      | f        | f
> (2 rows)
>
> -- A subsequent attempt to decode changes from the invalidated slot
> (test2) fails:
> postgres=# SELECT data FROM pg_logical_slot_get_changes('test2', NULL, NULL);
> WARNING:  detected write past chunk end in TXN 0x5e77e6c6f300
> ERROR:  logical decoding on standby requires "wal_level" >= "logical"
> on the primary
>
> -- Alternatively, the following error may occur:
> postgres=# SELECT data FROM pg_logical_slot_get_changes('test2', NULL, NULL);
> WARNING:  detected write past chunk end in TXN 0x582d1b2d6ef0
>     data
> ------------
>  BEGIN 744
>  COMMIT 744
> (2 rows)
>
> This is an edge case that can occur under specific conditions
> involving replication slot invalidation when there is a huge lag
> between primary and standby.
> There might be a similar concurrency case for wal_removed too.
>

Hi Vignesh,

Thanks for reviewing the patch.

I have tested the above scenario and was able to reproduce it. I have
fixed it in the v2 patch.
Currently we are taking a shared lock on ReplicationSlotControlLock.
This issue can be resolved if we take an exclusive lock instead.
Thoughts?

Thanks and Regards,
Shlok Kyal

Attachment: v2-0001-Restrict-copying-of-invalidated-replication-slots.patch
Description: Binary data

Reply via email to