Re: Assertion failure in WaitForWALToBecomeAvailable state machine

Noah Misch Mon, 12 Sep 2022 20:23:20 -0700

On Mon, Sep 12, 2022 at 05:58:08PM +0530, Bharath Rupireddy wrote:
> On Mon, Sep 12, 2022 at 12:30 PM Michael Paquier <[email protected]> wrote:
> > On Sat, Sep 10, 2022 at 07:52:01AM +0530, Bharath Rupireddy wrote:
> > > Just for the records - there's another report of the assertion failure
> > > at [1], many thanks to Kyotaro-san for providing consistent
> > > reproducible steps.
> > >
> > > [1] - 
> > > https://www.postgresql.org/message-id/flat/20220909.172949.2223165886970819060.horikyota.ntt%40gmail.com


I plan to use that message's patch, because it guarantees WALRCV_STOPPED at
the code location being changed.  Today, in the unlikely event of
!WalRcvStreaming() due to WALRCV_WAITING or WALRCV_STOPPING, that code
proceeds without waiting for WALRCV_STOPPED.  I think WALRCV_STOPPING can't
happen there.  Even if WALRCV_WAITING also can't happen, it's a simplicity win
to remove the need to think about that.  Dilip Kumar and Nathan Bossart also
stated support for that design.

If WALRCV_WAITING or WALRCV_STOPPING can happen at that patch's code site, I
perhaps should back-patch the change to released versions.  Does anyone know
whether one or both can happen?

> > FWIW, I
> > find better the approach taken by Horiguchi-san in [1] to reset the
> > state before we attempt to read WAL from the archives *or* pg_wal,
> > after we know that the last source has failed, where we know that we
> > are not streaming yet (but recovery may be requested soon).
> >
> > [1]: 
> > https://www.postgresql.org/message-id/[email protected]
> 
> I think the fix at [1] is wrong. It basically resets
> InstallXLogFileSegmentActive for both XLOG_FROM_ARCHIVE and
> XLOG_FROM_PG_WAL, remember that we don't need WAL files preallocation
> and recycling just for archive. Moreover, if we were to reset just for
> archive there, it's too aggressive, meaning for every WAL record, we
> take ControlFileLock in exclusive mode to reset it.
> 
> IMO, doing it once when we switch the source to archive is the correct
> way because it avoids frequent ControlFileLock acquisitions and makes
> the fix more intuitive. And we have a comment saying why we reset
> InstallXLogFileSegmentActive, if required we can point to the commit
> cc2c7d6 in the comments.

The startup process maintains the invariant that archive recovery and
streaming recovery don't overlap in time; one stops before the other starts.
As long as it achieves that, I'm not aware of a cause to care whether the flag
reset happens when streaming ends vs. when archive recovery begins.  If the
startup process develops a defect such that it ceases to maintain that
invariant, "when archive recovery begins" does have the
advantage-or-disadvantage of causing a failure in non-assert builds.  The
other way gets only an assertion failure.  Michael Paquier or Bharath
Rupireddy, did you envision benefits other than that one?

Re: Assertion failure in WaitForWALToBecomeAvailable state machine

Reply via email to