On Mon, Sep 12, 2022 at 05:58:08PM +0530, Bharath Rupireddy wrote: > On Mon, Sep 12, 2022 at 12:30 PM Michael Paquier <mich...@paquier.xyz> wrote: > > On Sat, Sep 10, 2022 at 07:52:01AM +0530, Bharath Rupireddy wrote: > > > Just for the records - there's another report of the assertion failure > > > at [1], many thanks to Kyotaro-san for providing consistent > > > reproducible steps. > > > > > > [1] - > > > https://www.postgresql.org/message-id/flat/20220909.172949.2223165886970819060.horikyota.ntt%40gmail.com
I plan to use that message's patch, because it guarantees WALRCV_STOPPED at the code location being changed. Today, in the unlikely event of !WalRcvStreaming() due to WALRCV_WAITING or WALRCV_STOPPING, that code proceeds without waiting for WALRCV_STOPPED. I think WALRCV_STOPPING can't happen there. Even if WALRCV_WAITING also can't happen, it's a simplicity win to remove the need to think about that. Dilip Kumar and Nathan Bossart also stated support for that design. If WALRCV_WAITING or WALRCV_STOPPING can happen at that patch's code site, I perhaps should back-patch the change to released versions. Does anyone know whether one or both can happen? > > FWIW, I > > find better the approach taken by Horiguchi-san in [1] to reset the > > state before we attempt to read WAL from the archives *or* pg_wal, > > after we know that the last source has failed, where we know that we > > are not streaming yet (but recovery may be requested soon). > > > > [1]: > > https://www.postgresql.org/message-id/20220214.171428.735280610520122187.horikyota....@gmail.com > > I think the fix at [1] is wrong. It basically resets > InstallXLogFileSegmentActive for both XLOG_FROM_ARCHIVE and > XLOG_FROM_PG_WAL, remember that we don't need WAL files preallocation > and recycling just for archive. Moreover, if we were to reset just for > archive there, it's too aggressive, meaning for every WAL record, we > take ControlFileLock in exclusive mode to reset it. > > IMO, doing it once when we switch the source to archive is the correct > way because it avoids frequent ControlFileLock acquisitions and makes > the fix more intuitive. And we have a comment saying why we reset > InstallXLogFileSegmentActive, if required we can point to the commit > cc2c7d6 in the comments. The startup process maintains the invariant that archive recovery and streaming recovery don't overlap in time; one stops before the other starts. As long as it achieves that, I'm not aware of a cause to care whether the flag reset happens when streaming ends vs. when archive recovery begins. If the startup process develops a defect such that it ceases to maintain that invariant, "when archive recovery begins" does have the advantage-or-disadvantage of causing a failure in non-assert builds. The other way gets only an assertion failure. Michael Paquier or Bharath Rupireddy, did you envision benefits other than that one?