Re: Assertion failure in SnapBuildInitialSnapshot()

Masahiko Sawada Tue, 25 Nov 2025 10:58:19 -0800

On Tue, Nov 25, 2025 at 4:02 AM Zhijie Hou (Fujitsu)
<[email protected]> wrote:
>
> On Tuesday, November 25, 2025 3:30 AM Masahiko Sawada <[email protected]> 
> wrote:
> >
> > On Mon, Nov 24, 2025 at 10:48 AM Masahiko Sawada
> > <[email protected]> wrote:
> > >
> > > On Mon, Nov 24, 2025 at 1:46 AM Amit Kapila <[email protected]>
> > wrote:
> > > >
> > > > On Fri, Nov 21, 2025 at 9:17 AM Zhijie Hou (Fujitsu)
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Thursday, November 13, 2025 12:56 PM Zhijie Hou (Fujitsu)
> > <[email protected]> wrote:
> > > > > >
> > > > >
> > > > > I have been thinking if there a way to avoid holding
> > > > > ReplicationSlotControlLock exclusively in
> > > > > ReplicationSlotsComputeRequiredXmin() because that could cause lock
> > contention when many slots exist and advancements occur frequently.
> > > > >
> > > > > Given that the bug arises from a race condition between slot
> > > > > creation and concurrent slot xmin computation, I think another way
> > > > > is that, we acquire the ReplicationSlotControlLock exclusively
> > > > > only during slot creation to do the initial update of the slot
> > > > > xmin. In ReplicationSlotsComputeRequiredXmin(), we still hold the
> > > > > ReplicationSlotControlLock in shared mode until the global slot
> > > > > xmin is updated in ProcArraySetReplicationSlotXmin(). This
> > > > > approach prevents concurrent computations and updates of new xmin
> > > > > horizons by other backends during the initial slot xmin update 
> > > > > process,
> > while it still permits concurrent calls to
> > ReplicationSlotsComputeRequiredXmin().
> > > > >
> > > >
> > > > Yeah, this seems to work.
> > >
> > > +1
> >
> > Given that the computation of xmin and catalog_xmin among all slots could
> > be executed concurrently, could the following scenario happen where
> > procArray->replication_slot_xmin and
> > procArray->replication_slot_catalog_xmin are retreat to a non-invalid
> > XID?
> >
> > 1. Suppose the initial value procArray->replication_slot_catalog_xmin is 50.
> > 2. Process-A updates its owned slot's catalog_xmin to 100, and computes the
> > new catalog_xmin as 100 while holding ReplicationSlotControlLock in a shared
> > mode in ReplicationSlotsComputeRequiredLSN(). But it doesn't update the
> > procArray's catalog_xmin value yet.
> > 3. Process-B updates its owned slot's catalog_xmin to 150, and computes the
> > new catalog_xmin as 150.
> > 4. Process-B updates the procArray->replication_slot_catalog_xmin to 150.
> > 5. Process-A updates the procArray->repilcation_slot_catalog_xmin to 100,
> > which was 150.
>
> After further investigation, I think that steps 3 and 4 cannot occur because
> Process-B must have already encountered the catalog_xmin maintained by
> Process-A, either 50 or 100. Consequently, Process-B will refrain from 
> updating
> the catalog_xmin to a more recent value, such as 150.


Right. But the following scenario seems to happen:

1. Both processes have a slot with effective_catalog_xmin = 100.
2. Process-A updates effective_catalog_xmin to 150, and computes the
new catalog_xmin as 100 because process-B slot still has
effective_catalog_xmin = 100.
3. Process-B updates effective_catalog_xmin to 150, and computes the
new catalog_xmin as 150.
4. Process-B updates procArray->replication_slot_catalog_xmin to 150.
5. Process-A updates procArray->replication_slot_catalog_xmin to 100.

>
> >
> > It might be worth adding an assertion to ProcArraySetReplicationSlotXmin(),
> > checking if the new xmin and catalog_xmin values are either >= the current
> > values or an InvalidTransactionId.
>
> I considered this scenario and identified a potential exception in the
> copy_replication_slot(). This function uses a two-phase copy process, the
> original restart_lsn is directly copied to the new slot during the first 
> phase.
> However, the original slot.restart_lsn might advance between phases.
> Consequently, the newly created slot initially uses the outdated restart_lsn,
> which could cause the procArray->replication_slot_catalog_xmin to retreat. I
> think this behavior isn't harmful, as explained in the comments, because the 
> new
> restart_lsn will be updated in the created slot during the second phase.

Agreed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Assertion failure in SnapBuildInitialSnapshot()

Reply via email to