On Fri, Jan 31, 2025 at 11:19 AM Peter Smith <smithpb2...@gmail.com> wrote: ... > On Fri, Jan 31, 2025 at 1:02 AM vignesh C <vignes...@gmail.com> wrote: > > > > On Thu, 30 Jan 2025 at 16:02, Amit Kapila <amit.kapil...@gmail.com> wrote: > > > > > > I have made minor changes in the attached. The main change is to raise > > > an ERROR before we broadcast to let everybody know we've modified this > > > slot. There is no point in broadcasting if the slot is unusable. > > > > > > > - Updated the test files to match the new error message. > > > > > > > > > > - /ERROR: cannot alter invalid replication slot > > > "vacuum_full_inactiveslot"/ > > > + /ERROR: can no longer access replication slot > > > "vacuum_full_inactiveslot"/ > > > > > > - "can no longer get changes from replication slot > > > \"shared_row_removal_activeslot\"" > > > + "can no longer access replication slot > > > \"shared_row_removal_activeslot\"" > > > > > > With the above changes, we made the ERROR message generic which was > > > required to raise it from the central place. This looks reasonable to > > > me. The other option that occurred to me is to write it as: "can no > > > longer use replication slot "vacuum_full_inactiveslot" to access WAL > > > or modify it" but I find the one used currently in the patch better as > > > this is longer and may need modification in future if we start using > > > slot for something else. > > > > One of the test was failing because MyReplicationSlot was not set to s > > before erroring out which resulted in: > > TRAP: failed Assert("s->data.persistency == RS_TEMPORARY"), File: > > "slot.c", Line: 777, PID: 280121 > > postgres: primary: walsender vignesh [local] > > START_REPLICATION(ExceptionalCondition+0xbb)[0x578612cd9289] > > postgres: primary: walsender vignesh [local] > > START_REPLICATION(ReplicationSlotCleanup+0x12b)[0x578612a32348] > > postgres: primary: walsender vignesh [local] > > START_REPLICATION(WalSndErrorCleanup+0x5e)[0x578612a41995] > > > > Fixed it by setting MyReplicationSlot just before erroring out so that > > WalSndErrorCleanup function will have access to it. I also moved the > > error reporting above as there is no need to check for slot is active > > for pid in case of invalidated slots. The attached patch has the > > changes for the same. > > > > 3. > + /* We made this slot active, so it's ours now. */ > + MyReplicationSlot = s; > + > + /* Invalid slots can't be modified or used before accessing the WAL. */ > + if (error_if_invalid && s->data.invalidated != RS_INVAL_NONE) > + ereport(ERROR, > + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("can no longer access replication slot \"%s\"", > + NameStr(s->data.name)), > + errdetail("This replication slot has been invalidated due to \"%s\".", > + SlotInvalidationCauses[s->data.invalidated])); > + > > This change looked dubious. I understand that it was done to avoid a > TRAP, but I am not sure if this is the correct fix. At the point of > that ereport ERROR this slot doesn't yet belong to us because we are > still determining *can* we acquire this slot or not, so I felt it > doesn't seem correct to have the MyReplicationSlot code/comment ("it's > ours now") come before the ERROR. > > Furthermore, having the code/comment "We made this slot active, so > it's ours now" ahead of the other code/comment "... but it's already > active in another process..." is contradictory. >
Hi Nisha. Further to my previous post, I was experimenting with an alternate fix for the TAP test error reported by Vignesh. Please see my diffs -- apply this atop the v4-0001 patch. Basically, I have just moved the invalidation ereport to be earlier in the code. IIUC, the TRAP might have been due to the v4-0001 still allowing the slot active_pid being modified to MyProcPid prematurely even when we did not end up acquiring the invalidated slot. Note, I also added a LWLockRelease to match the existing/adjacent ereport. Anyway, please consider it. The recovery and subscription TAP test are working for me. ====== Kind Regards, Peter Smith. Fujitsu Australia
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 74f7d56..29f0af9 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -565,6 +565,19 @@ retry: name))); } + /* Invalid slots can't be modified or used before accessing the WAL. */ + if (error_if_invalid && s->data.invalidated != RS_INVAL_NONE) + { + LWLockRelease(ReplicationSlotControlLock); + + ereport(ERROR, + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("can no longer access replication slot \"%s\"", + NameStr(s->data.name)), + errdetail("This replication slot has been invalidated due to \"%s\".", + SlotInvalidationCauses[s->data.invalidated])); + } + /* * This is the slot we want; check if it's active under some other * process. In single user mode, we don't need this check. @@ -589,18 +602,6 @@ retry: active_pid = MyProcPid; LWLockRelease(ReplicationSlotControlLock); - /* We made this slot active, so it's ours now. */ - MyReplicationSlot = s; - - /* Invalid slots can't be modified or used before accessing the WAL. */ - if (error_if_invalid && s->data.invalidated != RS_INVAL_NONE) - ereport(ERROR, - errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), - errmsg("can no longer access replication slot \"%s\"", - NameStr(s->data.name)), - errdetail("This replication slot has been invalidated due to \"%s\".", - SlotInvalidationCauses[s->data.invalidated])); - /* * If we found the slot but it's already active in another process, we * wait until the owning process signals us that it's been released, or @@ -628,6 +629,9 @@ retry: /* Let everybody know we've modified this slot */ ConditionVariableBroadcast(&s->active_cv); + /* We made this slot active, so it's ours now. */ + MyReplicationSlot = s; + /* * The call to pgstat_acquire_replslot() protects against stats for a * different slot, from before a restart or such, being present during