On Thursday, August 8, 2024 6:01 PM shveta malik <shveta.ma...@gmail.com> wrote: > > On Thu, Aug 8, 2024 at 12:03 PM Amit Kapila <amit.kapil...@gmail.com> > wrote: > > > > On Thu, Aug 8, 2024 at 10:37 AM Hayato Kuroda (Fujitsu) > > <kuroda.hay...@fujitsu.com> wrote: > > > > > ... > > > > > > An easiest fix is to reset session replication origin before calling > > > the RecordTransactionAbort(). I think this can happen when 1) > > > LogicalRepApplyLoop() raises an ERROR or 2) apply worker exits. > Attached patch can fix the issue on HEAD. > > > > > > > Few comments: > > ============= > > * > > @@ -4409,6 +4409,14 @@ start_apply(XLogRecPtr origin_startpos) > > } > > PG_CATCH(); > > { > > + /* > > + * Reset the origin data to prevent the advancement of origin > > + progress > > + * if the transaction failed to apply. > > + */ > > + replorigin_session_origin = InvalidRepOriginId; > > + replorigin_session_origin_lsn = InvalidXLogRecPtr; > > + replorigin_session_origin_timestamp = 0; > > > > Can't we call replorigin_reset() instead here? > > > > * > > + /* > > + * Register a callback to reset the origin state before aborting the > > + * transaction in ShutdownPostgres(). This is to prevent the > > + advancement > > + * of origin progress if the transaction failed to apply. > > + */ > > + before_shmem_exit(replorigin_reset, (Datum) 0); > > > > I think we need this despite resetting the origin-related variables in > > PG_CATCH block to handle FATAL error cases, right? If so, can we use > > PG_ENSURE_ERROR_CLEANUP() instead of PG_CATCH()? > > +1 > > Basic tests work fine on this patch. Just thinking out loud, > SetupApplyOrSyncWorker() is called for table-sync worker as well and IIUC > tablesync worker does not deal with 2PC txns. So do we even need to register > replorigin_reset() for tablesync worker as well? If we may hit such an issue > in > general, then perhaps we need it in table-sync worker otherwise not. It > needs some investigation. Thoughts?
I think this is a general issue that can occur not only due to 2PC. IIUC, this problem should arise if any ERROR arises after setting the replorigin_session_origin_lsn but before the CommitTransactionCommand is completed. If so, I think we should register it for tablesync worker as well. Best Regards, Hou zj