On Thu, Aug 8, 2024 at 6:08 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Thu, Aug 8, 2024 at 3:41 PM Zhijie Hou (Fujitsu) > <houzj.f...@fujitsu.com> wrote: > > > > On Thursday, August 8, 2024 6:01 PM shveta malik <shveta.ma...@gmail.com> > > wrote: > > > > > > On Thu, Aug 8, 2024 at 12:03 PM Amit Kapila <amit.kapil...@gmail.com> > > > wrote: > > > > > > > > On Thu, Aug 8, 2024 at 10:37 AM Hayato Kuroda (Fujitsu) > > > > <kuroda.hay...@fujitsu.com> wrote: > > > > > > > > > ... > > > > > > > > > > An easiest fix is to reset session replication origin before calling > > > > > the RecordTransactionAbort(). I think this can happen when 1) > > > > > LogicalRepApplyLoop() raises an ERROR or 2) apply worker exits. > > > Attached patch can fix the issue on HEAD. > > > > > > > > > > > > > Few comments: > > > > ============= > > > > * > > > > @@ -4409,6 +4409,14 @@ start_apply(XLogRecPtr origin_startpos) > > > > } > > > > PG_CATCH(); > > > > { > > > > + /* > > > > + * Reset the origin data to prevent the advancement of origin > > > > + progress > > > > + * if the transaction failed to apply. > > > > + */ > > > > + replorigin_session_origin = InvalidRepOriginId; > > > > + replorigin_session_origin_lsn = InvalidXLogRecPtr; > > > > + replorigin_session_origin_timestamp = 0; > > > > > > > > Can't we call replorigin_reset() instead here? > > > > > > > > * > > > > + /* > > > > + * Register a callback to reset the origin state before aborting the > > > > + * transaction in ShutdownPostgres(). This is to prevent the > > > > + advancement > > > > + * of origin progress if the transaction failed to apply. > > > > + */ > > > > + before_shmem_exit(replorigin_reset, (Datum) 0); > > > > > > > > I think we need this despite resetting the origin-related variables in > > > > PG_CATCH block to handle FATAL error cases, right? If so, can we use > > > > PG_ENSURE_ERROR_CLEANUP() instead of PG_CATCH()? > > > > > > +1 > > > > > > Basic tests work fine on this patch. Just thinking out loud, > > > SetupApplyOrSyncWorker() is called for table-sync worker as well and IIUC > > > tablesync worker does not deal with 2PC txns. So do we even need to > > > register > > > replorigin_reset() for tablesync worker as well? If we may hit such an > > > issue in > > > general, then perhaps we need it in table-sync worker otherwise not. It > > > needs some investigation. Thoughts? > > > > I think this is a general issue that can occur not only due to 2PC. IIUC, > > this > > problem should arise if any ERROR arises after setting the > > replorigin_session_origin_lsn but before the CommitTransactionCommand is > > completed. If so, I think we should register it for tablesync worker as > > well. > > > > As pointed out earlier, won't using PG_ENSURE_ERROR_CLEANUP() instead > of PG_CATCH() be enough?
Yes, I think it should suffice. IIUC, we are going to change 'replorigin_session_origin_lsn' only in start_apply() and not before that, and thus ensuring its reset during any ERROR or FATAL in start_apply() is good enough. And I guess we don't want this origin-reset to be called during regular shutdown, isn't it? But registering it through before_shmem_exit() will make the reset-function to be called during normal shutdown as well. And to answer my previous question (as Hou-San also pointed out), we do need it in table-sync worker as well. So a change in start_apply will make sure the fix is valid for both apply and tablesync worker. thanks Shveta