Hi, Commit f41d846 [1] introduced a race condition that allows a process to acquire and continue using an invalidated replication slot, leading to unexpected behavior.
Example Scenario: Suppose there is an orphaned logical slot that will be invalidated on the next checkpoint. Consider the following sequence of events during ReplicationSlotAcquire() - - A process (P1) attempts to acquire the slot. It checks the invalidated flag and sees that the slot is still valid, so it proceeds to set the s->active_pid to its own PID. - Before P1 can acquire the spinlock and update s->active_pid = MyProcPid, the checkpointer process invalidates the slot. - Now, P1 proceeds to set s->active_pid = MyProcPid and starts using an already invalidated slot, which should not happen. Using an invalidated slot should be prevented in the first place. In my test, this scenario triggered an assertion failure in CreateDecodingContext() - ``` TRAP: failed Assert("slot->data.invalidated == RS_INVAL_NONE"), File: "logical.c", Line: 546, PID: 1042884 postgres: nisha postgres [local] SELECT(ExceptionalCondition+0xbb)[0x649ff7696c82] postgres: nisha postgres [local] SELECT(CreateDecodingContext+0x23b)[0x649ff73bf9d3] postgres: nisha postgres [local] SELECT(+0x61fb71)[0x649ff73c3b71] postgres: nisha postgres [local] SELECT(pg_logical_slot_get_changes+0x26)[0x649ff73c3ed6] ... ``` The expected behavior was for ReplicationSlotAcquire() to detect the invalid slot and return an error instead of allowing its usage. Attached the patch v1 fixing this issue. Issue reported by: Hou Zhijie. [1] https://github.com/postgres/postgres/commit/f41d8468ddea34170fe19fdc17b5a247e7d3ac78 -- Thanks, Nisha
v1-0001-Fix-race-condition-in-ReplicationSlotAcquire-indr.patch
Description: Binary data