On Thu, Apr 3, 2025 at 11:08 AM Masahiko Sawada <sawada.m...@gmail.com> wrote: > > On Wed, Apr 2, 2025 at 7:58 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > > > On Thu, Apr 3, 2025 at 7:50 AM Zhijie Hou (Fujitsu) > > <houzj.f...@fujitsu.com> wrote: > > > > > > On Thu, Apr 3, 2025 at 3:30 AM Masahiko Sawada wrote: > > > > > > > > > > > On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu) > > > > <houzj.f...@fujitsu.com> wrote: > > > > > > > > Thank you for the explanation! I agree that the issue happens in these > > > > cases. > > > > > > > > As another idea, I wonder if we could somehow defer to make the synced > > > > slot as 'sync-ready' until we can ensure that the slot doesn't have > > > > any transactions that are prepared before the point of enabling > > > > two_phase. For example, when the slotsync worker fetches the remote > > > > slot, it remembers the confirmed_flush_lsn (say LSN-1) if the local > > > > slot's two_phase becomes true or the local slot is newly created with > > > > enabling two_phase, and then it makes the slot 'sync-ready' once it > > > > confirmed that the slot's restart_lsn passed LSN-1. Does it work? > > > > > > Thanks for the idea! > > > > > > We considered a similar approach in [1] to confirm there is no prepared > > > transactions before two_phase_at, but the issue is that when the > > > two_phase flag > > > is switched from 'false' to 'true' (as in the case with (copy_data=true, > > > failover=true, two_phase=true)). In this case, the slot may have already > > > been > > > marked as sync-ready before the two_phase flag is enabled, as slotsync is > > > unaware of potential future changes to the two_phase flag. > > > > This can happen because when copy_data is true, tablesync can take a > > long time to complete the sync and in the meantime, slot without a > > two_phase flag would have been synced to standby. Such a slot would be > > marked as sync-ready even if we follow the calculation proposed by > > Sawada-san. Note that we enable two_phase once all the tables are in > > ready state (See run_apply_worker() and comments atop worker.c > > (TWO_PHASE TRANSACTIONS)). > > Right. It doesn't make sense to make the slot not-sync-ready and then > back to sync-ready. > > While I agree with the approach for HEAD and it seems difficult to > find a solution, I'm concerned that disallowing to use both failover > and two_phase in a minor release would affect users much. Users who > are already using that combination might end up needing to re-think > their system architecture. So I'm trying to narrow down use cases > where we're going to prohibit or to find workarounds. > > If we agree with the fix for HEAD, we can push the fix for HEAD first, > which would be better to be done sooner as it needs to bump the > catversion. We can discuss the ideas and workarounds for v17 later. >
Thanks, I'll push the patch for HEAD and then keep thinking if we have a better way to deal with the problem in 17. BTW, the problem for 17 can happen in a much narrower set of cases as explained in the emails above. -- With Regards, Amit Kapila.