At Mon, 27 Jun 2022 15:02:11 +0900, Michael Paquier <mich...@paquier.xyz> wrote 
in 
> On Fri, Jun 24, 2022 at 04:17:34PM +0000, Imseih (AWS), Sami wrote:
> > It is been difficult to get a generic repro, but the way we reproduce
> > Is through our test suite. To give more details, we are running tests
> > In which we constantly failover and promote standbys. The issue
> > surfaces after we have gone through a few promotions which occur
> > every few hours or so ( not really important but to give context ).
> 
> Hmm.  Could you describe exactly the failover scenario you are using?
> Is the test using a set of cascading standbys linked to the promoted
> one?  Are the standbys recycled from the promoted nodes with pg_rewind
> or created from scratch with a new base backup taken from the
> freshly-promoted primary?  I have been looking more at this thread
> through the day but I don't see a remaining issue.  It could be
> perfectly possible that we are missing a piece related to the handling
> of those new overwrite contrecords in some cases, like in a rewind.
> 
> > I am adding some additional debugging  to see if I can draw a better
> > picture of what is happening. Will also give aborted_contrec_reset_3.patch 
> > a go, although I suspect it will not handle the specific case we are deaing 
> > with.
> 
> Yeah, this is not going to change much things if you are still seeing
> an issue.  This patch does not change the logic, aka it just

True. That is a siginicant hint on what happened at the time.

- Are there only two hosts in the replication set?  I concerned on
  whether it is a cascading set or not.

- Exactly what are you performing at every failover?  Especially do
  the steps contain pg_rewind, and do you copy pg_wal and/or archive
  files between the failover hosts?

> simplifies the tracking of the continuation record data, resetting it
> when a complete record has been read.  Saying that, getting rid of the
> dependency on StandbyMode because we cannot promote in the middle of a
> record is nice (my memories around that were a bit blurry but even
> recovery_target_lsn would not recover in the middle of an continuation
> record), and this is not bug so there is limited reason to backpatch
> this part of the change.

Agreed.  In the first place my "repro" (or the test case) is a bit too
intricated to happen in the real field.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Reply via email to