RE: Fix slot synchronization with two_phase decoding enabled

Zhijie Hou (Fujitsu) Thu, 08 May 2025 19:13:46 -0700

On Thu, May 8, 2025 at 6:04 PM Zhijie Hou (Fujitsu) wrote:
> 
> On Tue, May 6, 2025 at 7:22 PM Zhijie Hou (Fujitsu) wrote:
> 
> >
> > On Mon, May 5, 2025 at 6:59 PM Amit Kapila wrote:
> > >
> > > On Sun, May 4, 2025 at 2:33 PM Masahiko Sawada
> > <[email protected]>
> > > wrote:
> > > >
> > > > While I cannot be entirely certain of my analysis, I believe the 
> > > > root cause might be related to the backward movement of the 
> > > > confirmed_flush LSN. The following scenario seems possible:
> > > >
> > > > 1. The walsender enables the two_phase and sets two_phase_at 
> > > > (which should be the same as confirmed_flush).
> > > > 2. The slot's confirmed_flush regresses for some reason.
> > > > 3. The slotsync worker retrieves the remote slot information and 
> > > > enables two_phase for the local slot.
> > > >
> > >
> > > Yes, this is possible. Here is my theory as to how it can happen 
> > > in the current case. In the failed test, after the primary has 
> > > prepared a transaction, the transaction won't be replicated to the 
> > > subscriber as two_phase was not enabled for the slot. However, 
> > > subsequent keepalive messages can send the latest WAL location to 
> > > the subscriber and get the confirmation of the same from the 
> > > subscriber without its origin being moved. Now, after we restart 
> > > the apply worker (due to disable/enable for a subscription), it 
> > > will use the previous origin_lsn to temporarily move back the 
> > > confirmed flush LSN as explained in one of the previous emails in another 
> > > thread [1].
> > > During this temporary movement of confirm flush LSN, the slotsync 
> > > worker fetches the two_phase_at and confirm_flush_lsn values, 
> > > leading to the assertion failure. We see this issue intermittently 
> > > because it depends on the
> > timing of slotsync worker's request to fetch the slot's value.
> >
> > Based on this theory, I can reproduce the BF failure in the 040 
> > tap-test on HEAD after applying the 0001 patch. This is achieved by 
> > using the injection point to stop the walsender from sending a 
> > keepalive before receiving the old origin position from the apply 
> > worker, ensuring the confirmed_flush consistently moves backward 
> > before
> slotsync.
> >
> > Additionally, I've reproduced the duplicate data issue on HEAD 
> > without slotsync using the attached script (after applying the injection 
> > point patch).
> > This issue arises if we immediately disable the subscription after 
> > the confirm_flush_lsn moves backward, preventing the walsender from 
> > advancing the confirm_flush_lsn.
> >
> > In this case, if a prepared transaction exists before two_phase_at, 
> > then after re-enabling the subscription, it will replicate that 
> > prepared transaction when decoding the PREPARE record and replicate 
> > that again when decoding the COMMIT PREPARED record. In such cases, 
> > the apply worker keeps reporting the error:
> >
> > ERROR: transaction identifier "pg_gid_16387_755" is already in use.
> >
> > Apart from above, we're investigating whether the same issue can 
> > occur in back-branches and will share the results once ready.
> 
> I reproduced the duplicate data issue on PG17 as well using the 
> attached shell script. Since PG17 doesn’t allow altering the twophase 
> option, I created a subscription with two_phase=on and copy_data=on. I 
> prepared a transaction before the table synchronization was ready, at 
> a time when the slot's two_phase hadn't been set to true. This setup 
> can cause in the prepared transaction being replicated twice after 
> restarting the apply worker and the confirmed_flush_lsn move backwards.
> 
> To ensure the origin position is initialized during table sync, I 
> inserted some data before the prepared transaction. I added injection 
> points(0001) to manage the table sync worker's process, allowing the 
> apply worker to replicate some changes and update the origin position
> while table sync was ongoing.


The above reproduction of the issue indicates that it has been present since at
least PG15, when the twophase subscription option was introduced. I am
currently investigating whether the issue occurs without the twophase option.
If it does, the fix will need to be applied to all supported branches. I will
share the results once they are available.

Best Regards,
Hou zj

RE: Fix slot synchronization with two_phase decoding enabled

Reply via email to