On Tuesday, January 7, 2025 3:05 PM Masahiko Sawada <sawada.m...@gmail.com> 
wrote:

Hi,

> On Mon, Jan 6, 2025 at 10:40 PM Zhijie Hou (Fujitsu)
> <houzj.f...@fujitsu.com> wrote:
> >
> > On Tuesday, January 7, 2025 2:00 PM Masahiko Sawada
> <sawada.m...@gmail.com> wrote:
> >
> > Hi,
> >
> > >
> > > On Mon, Jan 6, 2025 at 3:22 AM Zhijie Hou (Fujitsu)
> <houzj.f...@fujitsu.com>
> > > wrote:
> > > >
> > > > On Friday, January 3, 2025 2:36 PM Masahiko Sawada
> > > <sawada.m...@gmail.com> wrote:
> > > >
> > > > >
> > > > > I have one comment on the 0001 patch:
> > > >
> > > > Thanks for the comments!
> > > >
> > > > >
> > > > > +       /*
> > > > > +        * The changes made by this and later transactions are still
> > > > > non-removable
> > > > > +        * to allow for the detection of update_deleted conflicts
> > > > > + when
> > > > > applying
> > > > > +        * changes in this logical replication worker.
> > > > > +        *
> > > > > +        * Note that this info cannot directly protect dead tuples 
> > > > > from
> > > being
> > > > > +        * prematurely frozen or removed. The logical replication
> launcher
> > > > > +        * asynchronously collects this info to determine whether to
> > > > > + advance
> > > > > the
> > > > > +        * xmin value of the replication slot.
> > > > > +        *
> > > > > +        * Therefore, FullTransactionId that includes both the
> > > > > transaction ID and
> > > > > +        * its epoch is used here instead of a single Transaction ID.
> This is
> > > > > +        * critical because without considering the epoch, the
> transaction
> > > ID
> > > > > +        * alone may appear as if it is in the future due to 
> > > > > transaction
> ID
> > > > > +        * wraparound.
> > > > > +        */
> > > > > +       FullTransactionId oldest_nonremovable_xid;
> > > > >
> > > > > The last paragraph of the comment mentions that we need to use
> > > > > FullTransactionId to properly compare XIDs even after the XID
> > > > > wraparound happens. But once we set the oldest-nonremovable-xid it
> > > > > prevents XIDs from being wraparound, no? I mean that workers'
> > > > > oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> > > > > xmin) are never away from more than 2^31 XIDs.
> > > >
> > > > I think the issue is that the launcher may create the replication slot
> > > > after the apply worker has already set the 'oldest_nonremovable_xid'
> > > > because the launcher are doing that asynchronously. So, Before the
> > > > slot is created, there's a window where transaction IDs might wrap
> > > > around. If initially the apply worker has computed a candidate_xid
> > > > (755) and the xid wraparound before the launcher creates the slot,
> > > > causing the new current xid to be (740), then the old
> > > > candidate_xid(755) looks like a xid in the future, and the launcher
> > > > could advance the xmin to 755 which cause the dead tuples to be
> removed
> > > prematurely.
> > > > (We are trying to reproduce this to ensure that it's a real issue and
> > > > will share after finishing)
> > >
> > > The slot's first xmin is calculated by
> > > GetOldestSafeDecodingTransactionId(false). The initial computed
> > > cancidate_xid could be newer than this xid?
> >
> > I think the issue occurs when the slot is created after an XID wraparound. 
> > As
> a
> > result, GetOldestSafeDecodingTransactionId() returns the current XID
> > (after wraparound), which appears older than the computed candidate_xid
> (e.g.,
> > oldest_nonremovable_xid). Nisha has shared detailed steps to reproduce the
> > issue in [1]. What do you think ?
> 
> I agree that the scenario Nisha shared could happen with the current
> patch. On the other hand, I think that if slot's initial xmin is
> always newer than or equal to the initial computed non-removable-xid
> (i.e., the oldest of workers' oldest_nonremovable_xid values), we can
> always use slot's first xmin. And I think it might be true while I'm
> concerned the fact that worker's oldest_nonremoable_xid and the slot's
> initial xmin is calculated differently (GetOldestActiveTransactionId()
> and GetOldestSafeDecodingTransactionId(), respectively). That way,
> subsequent comparisons between slot's xmin and computed candidate_xid
> won't need to take care of the epoch. IOW, the worker's
> non-removable-xid values effectively are not used until the slot is
> created.

I might be missing something, so could you please elaborate a bit more on this
idea?

Initially, I thought you meant delaying the initialization of slot.xmin until
after the worker computes the oldest_nonremovable_xid. However, I think the
same issue would occur with this approach as well [1], with the difference
being that the slot would directly use a future XID as xmin, which seems
inappropriate to me.

Or do you mean opposite that we delay the initialization of
oldest_nonremovable_xid after the creation of the slot ?

[1]
> So, Before the slot is created, there's a window where transaction IDs might
> wrap around. If initially the apply worker has computed a candidate_xid (755)
> and the xid wraparound before the launcher creates the slot, causing the new
> current xid to be (740), then the old candidate_xid(755) looks like a xid in
> the future, and the launcher could advance the xmin to 755 which cause the
> dead tuples to be removed prematurely.

Best Regards,
Hou zj

Reply via email to