On Mon, Aug 12, 2024 at 3:36 PM shveta malik <shveta.ma...@gmail.com> wrote: > > On Fri, Aug 9, 2024 at 2:39 PM Hayato Kuroda (Fujitsu) > <kuroda.hay...@fujitsu.com> wrote: > > > > Dear Amit, Shveta, Hou, > > > > Thanks for giving many comments! I've updated the patch. > > > > > @@ -4409,6 +4409,14 @@ start_apply(XLogRecPtr origin_startpos) > > > } > > > PG_CATCH(); > > > { > > > + /* > > > + * Reset the origin data to prevent the advancement of origin progress > > > + * if the transaction failed to apply. > > > + */ > > > + replorigin_session_origin = InvalidRepOriginId; > > > + replorigin_session_origin_lsn = InvalidXLogRecPtr; > > > + replorigin_session_origin_timestamp = 0; > > > > > > Can't we call replorigin_reset() instead here? > > > > I didn't use the function because arguments of calling function looked > > strange, > > but ideally I can. Fixed. > > > > > + /* > > > + * Register a callback to reset the origin state before aborting the > > > + * transaction in ShutdownPostgres(). This is to prevent the advancement > > > + * of origin progress if the transaction failed to apply. > > > + */ > > > + before_shmem_exit(replorigin_reset, (Datum) 0); > > > > > > I think we need this despite resetting the origin-related variables in > > > PG_CATCH block to handle FATAL error cases, right? If so, can we use > > > PG_ENSURE_ERROR_CLEANUP() instead of PG_CATCH()? > > > > There are two reasons to add a shmem-exit callback. One is to support a > > FATAL, > > another one is to support the case that user does the shutdown request while > > applying changes. In this case, I think ShutdownPostgres() can be called so > > that > > the session origin may advance. > > Agree that we need the 'reset' during shutdown flow as well. Details at [1] > > > However, I think we cannot use > > PG_ENSURE_ERROR_CLEANUP()/PG_END_ENSURE_ERROR_CLEANUP > > macros here. According to codes, it assumes that any before-shmem callbacks > > are > > not registered within the block because the cleanup function is registered > > and canceled > > within the macro. LogicalRepApplyLoop() can register the function when > > it handles COMMIT PREPARED message so it breaks the rule. > > Yes, on reanalyzing, we can not use PG_ENSURE_ERROR_CLEANUP in this > flow due to the limitation of cancel_before_shmem_exit() that it can > cancel only the last registered callback, while in our flow we have > other callbacks also registered after we register our reset one. > > [1] > Shutdown analysis: > > I did a test where we make apply worker wait for say 0sec right after
Correction here: 0sec -->10sec > it updates 'replorigin_session_origin_lsn' in > apply_handle_prepare_internal() (say it at code-point1). During this > wait, we triggered a subscriber shutdown.Under normal circumstances, > everything works fine: after the wait, the apply worker processes the > SIGTERM (via LogicalRepApplyLoop-->ProcessInterrupts()) only after the > prepare phase is complete, meaning the PREPARE LSN is flushed, and the > origin LSN is correctly advanced in EndPrepare() before the worker > shuts down. But, if we insert a LOG statement between code-point1 and > EndPrepare(), the apply worker processes the SIGTERM during the LOG > operation, as errfinish() triggers CHECK_FOR_INTERRUPTS at the end, > which causes the origin LSN to be incorrectly advanced during > shutdown. And thus the subsequent COMMIT PREPARED on the publisher > results in ERROR on subscriber; as the 'PREPARE' is lost on the > subscriber and is not resent by the publisher. ERROR: prepared > transaction with identifier "pg_gid_16403_757" does not exist > > A similar problem can also occur without introducing any additional > LOG statements, but by simply setting log_min_messages=debug5. This > causes the apply worker to output a few DEBUG messages upon receiving > a shutdown signal (after code-point1) before it reaches EndPrepare(). > As a result, it ends up processing the SIGTERM (during logging)and > invoking AbortOutOfAnyTransaction(), which incorrectly advances the > origin LSN. > > thanks > Shveta