On Fri, Jan 31, 2025 at 11:19 AM Peter Smith <smithpb2...@gmail.com> wrote:
...
> On Fri, Jan 31, 2025 at 1:02 AM vignesh C <vignes...@gmail.com> wrote:
> >
> > On Thu, 30 Jan 2025 at 16:02, Amit Kapila <amit.kapil...@gmail.com> wrote:
> > >
> > > I have made minor changes in the attached. The main change is to raise
> > > an ERROR before we broadcast to let everybody know we've modified this
> > > slot. There is no point in broadcasting if the slot is unusable.
> > >
> > > >   - Updated the test files to match the new error message.
> > > >
> > >
> > > -   /ERROR:  cannot alter invalid replication slot 
> > > "vacuum_full_inactiveslot"/
> > > +   /ERROR:  can no longer access replication slot 
> > > "vacuum_full_inactiveslot"/
> > >
> > > - "can no longer get changes from replication slot
> > > \"shared_row_removal_activeslot\""
> > > + "can no longer access replication slot 
> > > \"shared_row_removal_activeslot\""
> > >
> > > With the above changes, we made the ERROR message generic which was
> > > required to raise it from the central place. This looks reasonable to
> > > me. The other option that occurred to me is to write it as: "can no
> > > longer use replication slot "vacuum_full_inactiveslot" to access WAL
> > > or modify it" but I find the one used currently in the patch better as
> > > this is longer and may need modification in future if we start using
> > > slot for something else.
> >
> > One of the test was failing because MyReplicationSlot was not set to s
> > before erroring out which resulted in:
> > TRAP: failed Assert("s->data.persistency == RS_TEMPORARY"), File:
> > "slot.c", Line: 777, PID: 280121
> > postgres: primary: walsender vignesh [local]
> > START_REPLICATION(ExceptionalCondition+0xbb)[0x578612cd9289]
> > postgres: primary: walsender vignesh [local]
> > START_REPLICATION(ReplicationSlotCleanup+0x12b)[0x578612a32348]
> > postgres: primary: walsender vignesh [local]
> > START_REPLICATION(WalSndErrorCleanup+0x5e)[0x578612a41995]
> >
> > Fixed it by setting MyReplicationSlot just before erroring out so that
> > WalSndErrorCleanup function will have access to it. I also moved the
> > error reporting above as there is no need to check for slot is active
> > for pid in case of invalidated slots. The attached patch has the
> > changes for the same.
> >
>
> 3.
> + /* We made this slot active, so it's ours now. */
> + MyReplicationSlot = s;
> +
> + /* Invalid slots can't be modified or used before accessing the WAL. */
> + if (error_if_invalid && s->data.invalidated != RS_INVAL_NONE)
> + ereport(ERROR,
> + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("can no longer access replication slot \"%s\"",
> +    NameStr(s->data.name)),
> + errdetail("This replication slot has been invalidated due to \"%s\".",
> +   SlotInvalidationCauses[s->data.invalidated]));
> +
>
> This change looked dubious. I understand that it was done to avoid a
> TRAP, but I am not sure if this is the correct fix. At the point of
> that ereport ERROR this slot doesn't yet belong to us because we are
> still determining *can* we acquire this slot or not, so I felt it
> doesn't seem correct to have the MyReplicationSlot code/comment ("it's
> ours now") come before the ERROR.
>
> Furthermore, having the code/comment "We made this slot active, so
> it's ours now" ahead of the other code/comment "... but it's already
> active in another process..." is contradictory.
>

Hi Nisha.

Further to my previous post, I was experimenting with an alternate fix
for the TAP test error reported by Vignesh.

Please see my diffs --  apply this atop the v4-0001 patch.

Basically, I have just moved the invalidation ereport to be earlier in
the code. IIUC, the TRAP might have been due to the v4-0001 still
allowing the slot active_pid being modified to MyProcPid prematurely
even when we did not end up acquiring the invalidated slot.

Note, I also added a LWLockRelease to match the existing/adjacent ereport.

Anyway, please consider it. The recovery and subscription TAP test are
working for me.

======
Kind Regards,
Peter Smith.
Fujitsu Australia
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 74f7d56..29f0af9 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -565,6 +565,19 @@ retry:
                                                name)));
        }
 
+       /* Invalid slots can't be modified or used before accessing the WAL. */
+       if (error_if_invalid && s->data.invalidated != RS_INVAL_NONE)
+       {
+               LWLockRelease(ReplicationSlotControlLock);
+
+               ereport(ERROR,
+                               
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                               errmsg("can no longer access replication slot 
\"%s\"",
+                                          NameStr(s->data.name)),
+                               errdetail("This replication slot has been 
invalidated due to \"%s\".",
+                                                 
SlotInvalidationCauses[s->data.invalidated]));
+       }
+
        /*
         * This is the slot we want; check if it's active under some other
         * process.  In single user mode, we don't need this check.
@@ -589,18 +602,6 @@ retry:
                active_pid = MyProcPid;
        LWLockRelease(ReplicationSlotControlLock);
 
-       /* We made this slot active, so it's ours now. */
-       MyReplicationSlot = s;
-
-       /* Invalid slots can't be modified or used before accessing the WAL. */
-       if (error_if_invalid && s->data.invalidated != RS_INVAL_NONE)
-               ereport(ERROR,
-                               
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-                               errmsg("can no longer access replication slot 
\"%s\"",
-                                          NameStr(s->data.name)),
-                               errdetail("This replication slot has been 
invalidated due to \"%s\".",
-                                                 
SlotInvalidationCauses[s->data.invalidated]));
-
        /*
         * If we found the slot but it's already active in another process, we
         * wait until the owning process signals us that it's been released, or
@@ -628,6 +629,9 @@ retry:
        /* Let everybody know we've modified this slot */
        ConditionVariableBroadcast(&s->active_cv);
 
+       /* We made this slot active, so it's ours now. */
+       MyReplicationSlot = s;
+
        /*
         * The call to pgstat_acquire_replslot() protects against stats for a
         * different slot, from before a restart or such, being present during

Reply via email to