On Tuesday, January 7, 2025 3:05 PM Masahiko Sawada <sawada.m...@gmail.com> wrote:
Hi, > On Mon, Jan 6, 2025 at 10:40 PM Zhijie Hou (Fujitsu) > <houzj.f...@fujitsu.com> wrote: > > > > On Tuesday, January 7, 2025 2:00 PM Masahiko Sawada > <sawada.m...@gmail.com> wrote: > > > > Hi, > > > > > > > > On Mon, Jan 6, 2025 at 3:22 AM Zhijie Hou (Fujitsu) > <houzj.f...@fujitsu.com> > > > wrote: > > > > > > > > On Friday, January 3, 2025 2:36 PM Masahiko Sawada > > > <sawada.m...@gmail.com> wrote: > > > > > > > > > > > > > > I have one comment on the 0001 patch: > > > > > > > > Thanks for the comments! > > > > > > > > > > > > > > + /* > > > > > + * The changes made by this and later transactions are still > > > > > non-removable > > > > > + * to allow for the detection of update_deleted conflicts > > > > > + when > > > > > applying > > > > > + * changes in this logical replication worker. > > > > > + * > > > > > + * Note that this info cannot directly protect dead tuples > > > > > from > > > being > > > > > + * prematurely frozen or removed. The logical replication > launcher > > > > > + * asynchronously collects this info to determine whether to > > > > > + advance > > > > > the > > > > > + * xmin value of the replication slot. > > > > > + * > > > > > + * Therefore, FullTransactionId that includes both the > > > > > transaction ID and > > > > > + * its epoch is used here instead of a single Transaction ID. > This is > > > > > + * critical because without considering the epoch, the > transaction > > > ID > > > > > + * alone may appear as if it is in the future due to > > > > > transaction > ID > > > > > + * wraparound. > > > > > + */ > > > > > + FullTransactionId oldest_nonremovable_xid; > > > > > > > > > > The last paragraph of the comment mentions that we need to use > > > > > FullTransactionId to properly compare XIDs even after the XID > > > > > wraparound happens. But once we set the oldest-nonremovable-xid it > > > > > prevents XIDs from being wraparound, no? I mean that workers' > > > > > oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its > > > > > xmin) are never away from more than 2^31 XIDs. > > > > > > > > I think the issue is that the launcher may create the replication slot > > > > after the apply worker has already set the 'oldest_nonremovable_xid' > > > > because the launcher are doing that asynchronously. So, Before the > > > > slot is created, there's a window where transaction IDs might wrap > > > > around. If initially the apply worker has computed a candidate_xid > > > > (755) and the xid wraparound before the launcher creates the slot, > > > > causing the new current xid to be (740), then the old > > > > candidate_xid(755) looks like a xid in the future, and the launcher > > > > could advance the xmin to 755 which cause the dead tuples to be > removed > > > prematurely. > > > > (We are trying to reproduce this to ensure that it's a real issue and > > > > will share after finishing) > > > > > > The slot's first xmin is calculated by > > > GetOldestSafeDecodingTransactionId(false). The initial computed > > > cancidate_xid could be newer than this xid? > > > > I think the issue occurs when the slot is created after an XID wraparound. > > As > a > > result, GetOldestSafeDecodingTransactionId() returns the current XID > > (after wraparound), which appears older than the computed candidate_xid > (e.g., > > oldest_nonremovable_xid). Nisha has shared detailed steps to reproduce the > > issue in [1]. What do you think ? > > I agree that the scenario Nisha shared could happen with the current > patch. On the other hand, I think that if slot's initial xmin is > always newer than or equal to the initial computed non-removable-xid > (i.e., the oldest of workers' oldest_nonremovable_xid values), we can > always use slot's first xmin. And I think it might be true while I'm > concerned the fact that worker's oldest_nonremoable_xid and the slot's > initial xmin is calculated differently (GetOldestActiveTransactionId() > and GetOldestSafeDecodingTransactionId(), respectively). That way, > subsequent comparisons between slot's xmin and computed candidate_xid > won't need to take care of the epoch. IOW, the worker's > non-removable-xid values effectively are not used until the slot is > created. I might be missing something, so could you please elaborate a bit more on this idea? Initially, I thought you meant delaying the initialization of slot.xmin until after the worker computes the oldest_nonremovable_xid. However, I think the same issue would occur with this approach as well [1], with the difference being that the slot would directly use a future XID as xmin, which seems inappropriate to me. Or do you mean opposite that we delay the initialization of oldest_nonremovable_xid after the creation of the slot ? [1] > So, Before the slot is created, there's a window where transaction IDs might > wrap around. If initially the apply worker has computed a candidate_xid (755) > and the xid wraparound before the launcher creates the slot, causing the new > current xid to be (740), then the old candidate_xid(755) looks like a xid in > the future, and the launcher could advance the xmin to 755 which cause the > dead tuples to be removed prematurely. Best Regards, Hou zj