Re: [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber

shveta malik Mon, 12 Aug 2024 03:10:29 -0700

On Mon, Aug 12, 2024 at 3:36 PM shveta malik <shveta.ma...@gmail.com> wrote:
>
> On Fri, Aug 9, 2024 at 2:39 PM Hayato Kuroda (Fujitsu)
> <kuroda.hay...@fujitsu.com> wrote:
> >
> > Dear Amit, Shveta, Hou,
> >
> > Thanks for giving many comments! I've updated the patch.
> >
> > > @@ -4409,6 +4409,14 @@ start_apply(XLogRecPtr origin_startpos)
> > >   }
> > >   PG_CATCH();
> > >   {
> > > + /*
> > > + * Reset the origin data to prevent the advancement of origin progress
> > > + * if the transaction failed to apply.
> > > + */
> > > + replorigin_session_origin = InvalidRepOriginId;
> > > + replorigin_session_origin_lsn = InvalidXLogRecPtr;
> > > + replorigin_session_origin_timestamp = 0;
> > >
> > > Can't we call replorigin_reset() instead here?
> >
> > I didn't use the function because arguments of calling function looked 
> > strange,
> > but ideally I can. Fixed.
> >
> > > + /*
> > > + * Register a callback to reset the origin state before aborting the
> > > + * transaction in ShutdownPostgres(). This is to prevent the advancement
> > > + * of origin progress if the transaction failed to apply.
> > > + */
> > > + before_shmem_exit(replorigin_reset, (Datum) 0);
> > >
> > > I think we need this despite resetting the origin-related variables in
> > > PG_CATCH block to handle FATAL error cases, right? If so, can we use
> > > PG_ENSURE_ERROR_CLEANUP() instead of PG_CATCH()?
> >
> > There are two reasons to add a shmem-exit callback. One is to support a 
> > FATAL,
> > another one is to support the case that user does the shutdown request while
> > applying changes. In this case, I think ShutdownPostgres() can be called so 
> > that
> > the session origin may advance.
>
> Agree that we need the 'reset' during shutdown flow as well. Details at [1]
>
> > However, I think we cannot use 
> > PG_ENSURE_ERROR_CLEANUP()/PG_END_ENSURE_ERROR_CLEANUP
> > macros here. According to codes, it assumes that any before-shmem callbacks 
> > are
> > not registered within the block because the cleanup function is registered 
> > and canceled
> > within the macro. LogicalRepApplyLoop() can register the function when
> > it handles COMMIT PREPARED message so it breaks the rule.
>
> Yes, on reanalyzing, we can not use PG_ENSURE_ERROR_CLEANUP in this
> flow due to the limitation of cancel_before_shmem_exit() that it can
> cancel only the last registered callback, while in our flow we have
> other callbacks also registered after we register our reset one.
>
> [1]
> Shutdown analysis:
>
> I did a test where we make apply worker wait for say 0sec right after


Correction here: 0sec -->10sec

> it updates 'replorigin_session_origin_lsn' in
> apply_handle_prepare_internal() (say it at code-point1). During this
> wait, we triggered a subscriber shutdown.Under normal circumstances,
> everything works fine: after the wait, the apply worker processes the
> SIGTERM (via LogicalRepApplyLoop-->ProcessInterrupts()) only after the
> prepare phase is complete, meaning the PREPARE LSN is flushed, and the
> origin LSN is correctly advanced in EndPrepare() before the worker
> shuts down. But, if we insert a LOG statement between code-point1 and
> EndPrepare(), the apply worker processes the SIGTERM during the LOG
> operation, as errfinish() triggers CHECK_FOR_INTERRUPTS at the end,
> which causes the origin LSN to be incorrectly advanced during
> shutdown. And thus the subsequent COMMIT PREPARED on the publisher
> results in ERROR on subscriber; as the 'PREPARE' is lost on the
> subscriber and is not resent by the publisher. ERROR:  prepared
> transaction with identifier "pg_gid_16403_757" does not exist
>
> A similar problem can also occur without introducing any additional
> LOG statements, but by simply setting log_min_messages=debug5. This
> causes the apply worker to output a few DEBUG messages upon receiving
> a shutdown signal (after code-point1) before it reaches EndPrepare().
> As a result, it ends up processing the SIGTERM (during logging)and
> invoking AbortOutOfAnyTransaction(), which incorrectly advances the
> origin LSN.
>
> thanks
> Shveta

Re: [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber

Reply via email to