At Mon, 27 Jun 2022 15:02:11 +0900, Michael Paquier <mich...@paquier.xyz> wrote in > On Fri, Jun 24, 2022 at 04:17:34PM +0000, Imseih (AWS), Sami wrote: > > It is been difficult to get a generic repro, but the way we reproduce > > Is through our test suite. To give more details, we are running tests > > In which we constantly failover and promote standbys. The issue > > surfaces after we have gone through a few promotions which occur > > every few hours or so ( not really important but to give context ). > > Hmm. Could you describe exactly the failover scenario you are using? > Is the test using a set of cascading standbys linked to the promoted > one? Are the standbys recycled from the promoted nodes with pg_rewind > or created from scratch with a new base backup taken from the > freshly-promoted primary? I have been looking more at this thread > through the day but I don't see a remaining issue. It could be > perfectly possible that we are missing a piece related to the handling > of those new overwrite contrecords in some cases, like in a rewind. > > > I am adding some additional debugging to see if I can draw a better > > picture of what is happening. Will also give aborted_contrec_reset_3.patch > > a go, although I suspect it will not handle the specific case we are deaing > > with. > > Yeah, this is not going to change much things if you are still seeing > an issue. This patch does not change the logic, aka it just
True. That is a siginicant hint on what happened at the time. - Are there only two hosts in the replication set? I concerned on whether it is a cascading set or not. - Exactly what are you performing at every failover? Especially do the steps contain pg_rewind, and do you copy pg_wal and/or archive files between the failover hosts? > simplifies the tracking of the continuation record data, resetting it > when a complete record has been read. Saying that, getting rid of the > dependency on StandbyMode because we cannot promote in the middle of a > record is nice (my memories around that were a bit blurry but even > recovery_target_lsn would not recover in the middle of an continuation > record), and this is not bug so there is limited reason to backpatch > this part of the change. Agreed. In the first place my "repro" (or the test case) is a bit too intricated to happen in the real field. regards. -- Kyotaro Horiguchi NTT Open Source Software Center