On Mon, Jan 15, 2024 at 1:18 PM Bertrand Drouvot <bertranddrouvot...@gmail.com> wrote: > > Hi hackers, > > While working on [1], we discovered (thanks Alexander for the testing) that an > conflicting active logical slot on a standby could be "terminated" without > leading to an "obsolete" message (see [2]). > > Indeed, in case of an active slot we proceed in 2 steps in > InvalidatePossiblyObsoleteSlot(): > > - terminate the backend holding the slot > - report the slot as obsolete > > This is racy because between the two we release the mutex on the slot, which > means that the slot's effective_xmin and effective_catalog_xmin could advance > during that time (leading to exit the loop). > > I think that holding the mutex longer is not an option (given what we'd to do > while holding it) so the attached proposal is to record the effective_xmin and > effective_catalog_xmin instead that was used during the backend termination. > > [1]: > https://www.postgresql.org/message-id/flat/bf67e076-b163-9ba3-4ade-b9fc51a3a8f6%40gmail.com > [2]: > https://www.postgresql.org/message-id/7c025095-5763-17a6-8c80-899b35bd0459%40gmail.com > > Looking forward to your feedback,
IIUC, the issue is that we terminate the process holding the replication slot, and the conflict cause that we recorded before terminating the process changes in the next iteration due to the advancement in effective_xmin and/or effective_catalog_xmin. FWIW, a test code something like [1], can help detect above race issues, right? Some comments on the patch: 1. last_signaled_pid = active_pid; + terminated = true; } Why is a separate variable needed? Can't last_signaled_pid != 0 enough to determine that a process was terminated earlier? 2. If my understanding of the racy behavior is right, can the same issue happen due to the advancement in restart_lsn? I'm not sure if it can happen at all, but I think we can rely on previous conflict reason instead of previous effective_xmin/effective_catalog_xmin/restart_lsn. 3. Is there a way to reproduce this racy issue, may be by adding some sleeps or something? If yes, can you share it here, just for the records and verify the whatever fix provided actually works? [1] diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 52da694c79..d020b038bc 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1352,6 +1352,7 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, { int last_signaled_pid = 0; bool released_lock = false; + ReplicationSlotInvalidationCause conflict_prev = RS_INVAL_NONE; for (;;) { @@ -1417,6 +1418,18 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, } } + /* + * Check if the conflict cause recorded previously before we terminate + * the process changed now for any reason. + */ + if (conflict_prev != RS_INVAL_NONE && + last_signaled_pid != 0 && + conflict_prev != conflict) + elog(PANIC, "conflict cause recorded before terminating process %d has been changed; previous cause: %d, current cause: %d", + last_signaled_pid, + conflict_prev, + conflict); + /* if there's no conflict, we're done */ if (conflict == RS_INVAL_NONE) { @@ -1499,6 +1512,7 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, (void) kill(active_pid, SIGTERM); last_signaled_pid = active_pid; + conflict_prev = conflict; } /* Wait until the slot is released. */ -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com