Re: Conflict detection for update_deleted in logical replication

shveta malik Fri, 16 May 2025 03:13:39 -0700

On Fri, May 16, 2025 at 12:17 PM Amit Kapila <[email protected]> wrote:
>
> On Fri, Apr 25, 2025 at 10:08 AM shveta malik <[email protected]> wrote:
> >
> > On Thu, Apr 24, 2025 at 6:11 PM Zhijie Hou (Fujitsu)
> > <[email protected]> wrote:
> >
> > > > Few comments for patch004:
> > > > Config.sgml:
> > > > 1)
> > > > +       <para>
> > > > +        Maximum duration (in milliseconds) for which conflict
> > > > +        information can be retained for conflict detection by the 
> > > > apply worker.
> > > > +        The default value is <literal>0</literal>, indicating that 
> > > > conflict
> > > > +        information is retained until it is no longer needed for 
> > > > detection
> > > > +        purposes.
> > > > +       </para>
> > > >
> > > > IIUC, the above is not entirely accurate. Suppose the subscriber 
> > > > manages to
> > > > catch up and sets oldest_nonremovable_xid to 100, which is then updated 
> > > > in
> > > > slot. After this, the apply worker takes a nap and begins a new xid 
> > > > update cycle.
> > > > Now, let’s say the next candidate_xid is 200, but this time the 
> > > > subscriber fails
> > > > to keep up and exceeds max_conflict_retention_duration. As a result, it 
> > > > sets
> > > > oldest_nonremovable_xid to InvalidTransactionId, and the launcher skips
> > > > updating the slot’s xmin.
> > >
> > > If the time exceeds the max_conflict_retention_duration, the launcher 
> > > would
> > > Invalidate the slot, instead of skipping updating it. So the conflict 
> > > info(e.g.,
> > > dead tuples) would not be retained anymore.
> > >
> >
> > launcher will not invalidate the slot until all subscriptions have
> > stopped conflict_info retention. So info of dead tuples for a
> > particular oldest_xmin of a particular apply worker could be retained
> > for much longer than this configured duration. If other apply workers
> > are actively working (catching up with primary), then they should keep
> > on advancing xmin of shared slot but if xmin of shared slot remains
> > same for say 15min+15min+15min for 3 apply-workers (assuming they are
> > marking themselves with stop_conflict_retention one after other and
> > xmin of slot has not been advanced), then the first apply worker
> > having marked itself with stop_conflict_retention still has access to
> > the oldest_xmin's data for 45 mins instead of 15 mins. (where
> > max_conflict_retention_duration=15 mins). Please let me know if my
> > understanding is wrong.
> >
>
> IIUC, the current code will stop updating the slot even if one of the
> apply workers has set stop_conflict_info_retention. The other apply
> workers will keep on maintaining their oldest_nonremovable_xid without
> advancing the slot. If this is correct, then what behavior instead we
> expect here?


I think this is not the current behaviour.

> Do we want the slot to keep advancing till any worker is
> actively maintaining oldest_nonremovable_xid?

In fact, this one is the current behaviour of v30 patch.

> To some extent, this
> matches with the cases where the user has set retain_conflict_info for
> some subscriptions but not for others.
>
> If so, how will users eventually know for which tables they can expect
> to reliably detect update_delete? One possibility is that users can
> check which apply workers have stopped maintaining
> oldest_nonremovable_xid via pg_stat_subscription view and then see the
> tables corresponding to those subscriptions.

Yes, it is a possibility, but I feel it will be too much to monitor
from the user's perspective.

> Also, what will we do as
> part of the resolutions in the applyworkers where
> stop_conflict_info_retention is set? Shall we simply LOG that we can't
> resolve this conflict and continue till the user takes some action, or
> simply error out in such cases?

We can LOG. Erroring out again will prevent the subscriber from
proceeding, and the subscriber initially reached this state due to
falling behind, which led to stop_conflict_retention=true. But still
if we go with erroring out, I am not very sure what action users can
take in this situation? Subscriber is still lagging and if the user
recreates the slot as a solution, apply worker will soon go to
'stop_conflict_retention=true' state again, provided the subscriber is
still not able to catch-up.

thanks
Shveta

Re: Conflict detection for update_deleted in logical replication

Reply via email to