On Tue, May 6, 2025 at 4:52 PM Zhijie Hou (Fujitsu) <houzj.f...@fujitsu.com> wrote: > > On Mon, May 5, 2025 at 6:59 PM Amit Kapila wrote: > > > > > > Yes, this is possible. Here is my theory as to how it can happen in the > > current > > case. In the failed test, after the primary has prepared a transaction, the > > transaction won't be replicated to the subscriber as two_phase was not > > enabled for the slot. However, subsequent keepalive messages can send the > > latest WAL location to the subscriber and get the confirmation of the same > > from > > the subscriber without its origin being moved. Now, after we restart the > > apply > > worker (due to disable/enable for a subscription), it will use the previous > > origin_lsn to temporarily move back the confirmed flush LSN as explained in > > one of the previous emails in another thread [1]. During this temporary > > movement of confirm flush LSN, the slotsync worker fetches the two_phase_at > > and confirm_flush_lsn values, leading to the assertion failure. We see this > > issue intermittently because it depends on the timing of slotsync worker's > > request to fetch the slot's value. > > Based on this theory, I can reproduce the BF failure in the 040 tap-test on > HEAD after applying the 0001 patch. This is achieved by using the injection > point to stop the walsender from sending a keepalive before receiving the old > origin position from the apply worker, ensuring the confirmed_flush > consistently moves backward before slotsync. > > Additionally, I've reproduced the duplicate data issue on HEAD without > slotsync > using the attached script (after applying the injection point patch). This > issue arises if we immediately disable the subscription after the > confirm_flush_lsn moves backward, preventing the walsender from advancing the > confirm_flush_lsn. >
Script contents: psql -d postgres -p $port_primary -c "create extension injection_points;SELECT injection_points_attach('process-replies', 'wait');" psql -d postgres -p $port_subscriber -c "alter subscription sub set (two_phase =on); alter subscription sub enable ;" sleep 1 psql -d postgres -p $port_subscriber -c "alter subscription sub disable;" I think what you said in the above paragraph is happening here. How can walsender move back the confirm_flush_lsn backwards when it is waiting due to the injection point? I think I am missing something here. It would be good if you could add a few comments to your scripts. -- With Regards, Amit Kapila.