On Tue, Aug 5, 2025 at 3:11 AM shveta malik <shveta.ma...@gmail.com> wrote:
>
> On Tue, Aug 5, 2025 at 5:14 AM Masahiko Sawada <sawada.m...@gmail.com> wrote:
> >
> > On Mon, Aug 4, 2025 at 3:38 AM shveta malik <shveta.ma...@gmail.com> wrote:
> > >
> > > On Sat, Aug 2, 2025 at 4:53 AM Masahiko Sawada <sawada.m...@gmail.com> 
> > > wrote:
> > > >
> > > > On Thu, Jul 31, 2025 at 5:00 AM Hayato Kuroda (Fujitsu)
> > > > <kuroda.hay...@fujitsu.com> wrote:
> > > > >
> > > > > Dear Sawada-san,
> > > > >
> > > > > > I thought we could fix this issue by checking the number of in-use
> > > > > > logical slots while holding ReplicationSlotControlLock and
> > > > > > LogicalDecodingControlLock, but it seems we need to deal with 
> > > > > > another
> > > > > > race condition too between backends and startup processes at the end
> > > > > > of recovery.
> > > > > >
> > > > > > Currently the backend skips controlling logical decoding status if 
> > > > > > the
> > > > > > server is in recovery (by checking RecoveryInProgress()), but it's
> > > > > > possible that a backend process tries to drop a logical slot after 
> > > > > > the
> > > > > > startup process calling UpdateLogicalDecodingStatusEndOfRecovery() 
> > > > > > and
> > > > > > before accepting writes.
> > > > >
> > > > > Right. I also verified on local and found that
> > > > > ReplicationSlotDropAcquired()->DisableLogicalDecodingIfNecessary() 
> > > > > sometimes
> > > > > skips to modify the status because RecoveryInProgress is still false.
> > > > >
> > > > > > In this case, the backend ends up not
> > > > > > disabling logical decoding and it remains enabled. I think we would
> > > > > > somehow need to delay the logical decoding status change in this
> > > > > > period until the recovery completes.
> > > > >
> > > > > My primitive idea was to 1) keep startup acquiring the lock till end 
> > > > > of recovery
> > > > > and 2) DisableLogicalDecodingIfNecessary() acquires lock before 
> > > > > checking the
> > > > > recovery status, but it could not work well. Not sure but 
> > > > > WaitForProcSignalBarrier()
> > > > > stucked if the process acquired LogicalDecodingControlLock lock....
> > > >
> > > > I think that it's not realistic to keep holding a lwlock until the
> > > > recovery actually completes because we perform a checkpoint after
> > > > that.
> > > >
> > > > In the latest version patch I attached, I introduce a flag on shared
> > > > memory to delay any logical decoding status change until the recovery
> > > > completes. The implementation got more complex than I expected but I
> > > > don't have a better idea. I'm open to other approaches. Also, I
> > > > incorporated all comments I got so far[1][2][3] and updated the
> > > > documentation.
> > > >
> > >
> > > Yes, it is slightly complex, I will put more thoughts into it. That
> > > said, I do have a related scenario in mind concerning the recent fix,
> > > where we might still end up with an incorrect effective_wal_level
> > > after promotion.
> > >
> > > Say primary has 'wal_level'=replica and standby has
> > > 'wal_level'=logical. Since there are no slots on standby
> > > 'effective_wal_level' will still be replica. Now I created a slot both
> > > on primary and standby making 'effective_wal_level'=logical. Now, when
> > > the standby is promoted and the slot is dropped immediately after
> > > UpdateLogicalDecodingStatusEndOfRecovery() releases the lock, we still
> > > expect the effective_wal_level on the promoted standby (now the
> > > primary) to remain logical, since its configured 'wal_level' is
> > > logical and it has become the primary. But I think that may not be the
> > > case because 
> > > 'DisableLogicalDecodingIfNecessary-->start_logical_decoding_status_change()'
> > > does not consider original wal_level on promoted standby in
> > > retrial-attempt. I feel 'retry' should be above ' wal_level ==
> > > WAL_LEVEL_LOGICAL' check in below code snippet:
> > >
> > > +static bool
> > > +start_logical_decoding_status_change(bool new_status)
> > > +{
> > > + /*
> > > + * On the primary with 'logical' WAL level, we can skip logical decoding
> > > + * status change as it's always enabled. On standbys, we need to check 
> > > the
> > > + * status on shared memory propagated from the primary and might handle
> > > + * status change delay.
> > > + */
> > > + if (!RecoveryInProgress() && wal_level == WAL_LEVEL_LOGICAL)
> > > + return false;
> > > +
> > > +retry:
> > > +
> > >
> > > Please note that I could not reproduce this scenario because as soon
> > > as I put sleep or injection-point in
> > > UpdateLogicalDecodingStatusEndOfRecovery(), I hit some ProcSignal
> > > Barriers issue i.e. it never completes even when sleep is over. I get
> > > this: 'LOG: still waiting for backend with PID 162838 to accept
> > > ProcSignalBarrier'.
> >
> > Thank you for the comment! I think you're right. That check should be
> > done after 'retry'. WIll incorporate the change in the next version
> > patch.
> >
>
> Thanks.
>

Thank you for reviewing the patch!

> 1)
>
> start_logical_decoding_status_change():
>
> + LWLockRelease(LogicalDecodingControlLock);
> +
> + /* Mark the state transition is in-progress */
> + LogicalDecodingCtl->transition_in_progress = true;
>
> I think we should set transition_in_progress before releasing lock,
> else it may hit a race condition between create and drop slot and can
> end up having a slot but with logical decoding disabled.

Ugh, you're right. It should be protected by the lwlock.

>
> 2)
> CheckLogicalDecodingRequirements has this change:
>
> - if (wal_level < WAL_LEVEL_LOGICAL)
> + if (wal_level < WAL_LEVEL_REPLICA)
>   ereport(ERROR,
>   (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> - errmsg("logical decoding requires \"wal_level\" >= \"logical\"")));
> + errmsg("logical decoding requires \"wal_level\" >= \"replica\"")));
>
>
> But we already have same wal_level check  in CheckSlotRequirements:
>
>         if (wal_level < WAL_LEVEL_REPLICA)
>                 ereport(ERROR,
>
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>                                  errmsg("replication slots can only be
> used if \"wal_level\" >= \"replica\"")));
>
> Thus the change in CheckLogicalDecodingRequirements for 'wal_level <
> WAL_LEVEL_REPLICA' will never be reached. Is it needed?

No, I agree to remove it.

I've attached the updated version patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment: v6-0001-Enable-logical-decoding-dynamically-based-on-logi.patch
Description: Binary data

Reply via email to