On Tue, May 6, 2025 at 12:26 AM Masahiko Sawada <sawada.m...@gmail.com> wrote: > > On Mon, May 5, 2025 at 3:59 AM Amit Kapila <amit.kapil...@gmail.com> wrote: > > > > On Sun, May 4, 2025 at 2:33 PM Masahiko Sawada <sawada.m...@gmail.com> > > wrote: > > > > > > While I cannot be entirely certain of my analysis, I believe the root > > > cause might be related to the backward movement of the confirmed_flush > > > LSN. The following scenario seems possible: > > > > > > 1. The walsender enables the two_phase and sets two_phase_at (which > > > should be the same as confirmed_flush). > > > 2. The slot's confirmed_flush regresses for some reason. > > > 3. The slotsync worker retrieves the remote slot information and > > > enables two_phase for the local slot. > > > > > > > Yes, this is possible. Here is my theory as to how it can happen in > > the current case. In the failed test, after the primary has prepared a > > transaction, the transaction won't be replicated to the subscriber as > > two_phase was not enabled for the slot. However, subsequent keepalive > > messages can send the latest WAL location to the subscriber and get > > the confirmation of the same from the subscriber without its origin > > being moved. Now, after we restart the apply worker (due to > > disable/enable for a subscription), it will use the previous > > origin_lsn to temporarily move back the confirmed flush LSN as > > explained in one of the previous emails in another thread [1]. During > > this temporary movement of confirm flush LSN, the slotsync worker > > fetches the two_phase_at and confirm_flush_lsn values, leading to the > > assertion failure. We see this issue intermittently because it depends > > on the timing of slotsync worker's request to fetch the slot's value. > > > > If this theory is correct, then we need something on the lines of what > > Vignesh proposed in email [2] (Confirm_flush_dont_allow_backward) to > > fix it. > > This theory seems plausible. According to the thread you shared, we > didn't address the issue of backward movement in the slot's > confirmed_flush LSN. I think that we should carefully consider whether > this behavior is acceptable in the first place. If we decide to > maintain this behavior, we would need to modify the slotsync to deal > with such scenarios (note we have already ensured that the synced > slot's confirmed_flush cannot regress). However, I'm concerned about a > case like where the server crashes immediately after persisting the > retreated confirmed_flush LSN (along with restart_lsn or xmin updates) > to disk. In such a case, wouldn't the walsender potentially send > duplicate data? >
Yes that could happen but last time when we tried we couldn't reproduce such a case. However, we will try some more to see if we can reproduce a problem even without slotsync. Having said that, we will also try to reproduce the failure with slotsync in the test where BF is showing failure to confirm the above theory. > If so, we would need a patch like Vignesh proposed. > Right, I feel that is the right thing to do, even if we could prove that it can lead to failure in slotsync. -- With Regards, Amit Kapila.