On Thu, Aug 8, 2024 at 3:41 PM Zhijie Hou (Fujitsu) <houzj.f...@fujitsu.com> wrote: > > On Thursday, August 8, 2024 6:01 PM shveta malik <shveta.ma...@gmail.com> > wrote: > > > > On Thu, Aug 8, 2024 at 12:03 PM Amit Kapila <amit.kapil...@gmail.com> > > wrote: > > > > > > On Thu, Aug 8, 2024 at 10:37 AM Hayato Kuroda (Fujitsu) > > > <kuroda.hay...@fujitsu.com> wrote: > > > > > > > ... > > > > > > > > An easiest fix is to reset session replication origin before calling > > > > the RecordTransactionAbort(). I think this can happen when 1) > > > > LogicalRepApplyLoop() raises an ERROR or 2) apply worker exits. > > Attached patch can fix the issue on HEAD. > > > > > > > > > > Few comments: > > > ============= > > > * > > > @@ -4409,6 +4409,14 @@ start_apply(XLogRecPtr origin_startpos) > > > } > > > PG_CATCH(); > > > { > > > + /* > > > + * Reset the origin data to prevent the advancement of origin > > > + progress > > > + * if the transaction failed to apply. > > > + */ > > > + replorigin_session_origin = InvalidRepOriginId; > > > + replorigin_session_origin_lsn = InvalidXLogRecPtr; > > > + replorigin_session_origin_timestamp = 0; > > > > > > Can't we call replorigin_reset() instead here? > > > > > > * > > > + /* > > > + * Register a callback to reset the origin state before aborting the > > > + * transaction in ShutdownPostgres(). This is to prevent the > > > + advancement > > > + * of origin progress if the transaction failed to apply. > > > + */ > > > + before_shmem_exit(replorigin_reset, (Datum) 0); > > > > > > I think we need this despite resetting the origin-related variables in > > > PG_CATCH block to handle FATAL error cases, right? If so, can we use > > > PG_ENSURE_ERROR_CLEANUP() instead of PG_CATCH()? > > > > +1 > > > > Basic tests work fine on this patch. Just thinking out loud, > > SetupApplyOrSyncWorker() is called for table-sync worker as well and IIUC > > tablesync worker does not deal with 2PC txns. So do we even need to register > > replorigin_reset() for tablesync worker as well? If we may hit such an > > issue in > > general, then perhaps we need it in table-sync worker otherwise not. It > > needs some investigation. Thoughts? > > I think this is a general issue that can occur not only due to 2PC. IIUC, this > problem should arise if any ERROR arises after setting the > replorigin_session_origin_lsn but before the CommitTransactionCommand is > completed. If so, I think we should register it for tablesync worker as well. >
As pointed out earlier, won't using PG_ENSURE_ERROR_CLEANUP() instead of PG_CATCH() be enough? -- With Regards, Amit Kapila.