At Wed, 27 Apr 2022 14:16:01 +0900, Michael Paquier <mich...@paquier.xyz> wrote in > On Tue, Apr 26, 2022 at 08:26:09PM -0700, Nathan Bossart wrote: > > On Wed, Apr 27, 2022 at 10:43:53AM +0900, Kyotaro Horiguchi wrote: > >> At Tue, 26 Apr 2022 11:33:49 -0700, Nathan Bossart > >> <nathandboss...@gmail.com> wrote in > >>> I suspect we'll start seeing this problem more often once end-of-recovery > >>> checkpoints are removed [0]. Would you mind creating a commitfest entry > >>> for this thread? I didn't see one. > >> > >> I'm not sure the patch makes any change here, because restart points > >> don't run while crash recovery, since no checkpoint records seen > >> during a crash recovery. Anyway the patch doesn't apply anymore so > >> rebased, but only the one for master for the lack of time for now. > > > > Thanks for the new patch! Yeah, it wouldn't affect crash recovery, but > > IIUC Robert's patch also applies to archive recovery. > > > >> + /* Update pg_control */ > >> + LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); > >> + > >> /* > >> * Remember the prior checkpoint's redo ptr for > >> * UpdateCheckPointDistanceEstimate() > >> */ > >> PriorRedoPtr = ControlFile->checkPointCopy.redo; > > > > nitpick: Why move the LWLockAcquire() all the way up here? > > Yeah, that should not be necessary. InitWalRecovery() is the only > place outside the checkpointer that would touch this field, but that > happens far too early in the startup sequence to matter with the > checkpointer.
Yes it is not necessary. I just wanted to apparently ensure not to access ControlFile outside ControlFileLoc. So I won't be opposed to reverting it since, as you say, it is *actuall* safe.. > >> + Assert (PriorRedoPtr < RedoRecPtr); > > > > I think this could use a short explanation. > > That's just to make sure that the current redo LSN is always older > than the one prior that. It does not seem really necessary to me to > add that. Just after we call UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr). Don't we really need any safeguard against giving a wrap-arounded (in other words, treamendously large) value to the function? Actually it doesn't seem to happen now, but I don't confident that that never ever happens. That being said, I'm a minority here, so removed it. > >> + /* > >> + * Aarchive recovery has ended. Crash recovery ever after should > >> + * always recover to the end of WAL > >> + */ > > s/Aarchive/Archive/. Oops! Fixed. > >> + ControlFile->minRecoveryPoint = InvalidXLogRecPtr; > >> + ControlFile->minRecoveryPointTLI = 0; > >> + > >> + /* also update local copy */ > >> + LocalMinRecoveryPoint = InvalidXLogRecPtr; > >> + LocalMinRecoveryPointTLI = 0; > > > > Should this be handled by the code that changes the control file state to > > DB_IN_PRODUCTION instead? It looks like this is ordinarily done in the > > next checkpoint. It's not clear to me why it is done this way. > > Anyway, that would be the work of the end-of-recovery checkpoint > requested at the end of StartupXLOG() once a promotion happens or of > the checkpoint requested by PerformRecoveryXLogAction() in the second > case, no? So, I don't quite see why we need to update Eventually the work is done by StartupXLOG(). So we don't need to do that at all even in CreateCheckPoint(). If we expect that the end-of-recovery checkpoint clears it, that won't happen if if the last restart point takes so long time that the end-of-recovery checkpoint request is ignored. If DB_IN_ARCHIVE_RECOVERY ended while the restart point is running, it is highly possible that the end-of-recovery checkpoint trigger is ignored. In that case the values are cleard at the next checkpoint. In short, if we want to reset them at so-called end-of-recovery checkpoint, we should do that also in CreateRecoveryPoint. So, it is not removed in this version. > minRecoveryPoint and minRecoveryPointTLI in the control file here, as > much as this does not have to be part of the end-of-recovery code > that switches the control file to DB_IN_PRODUCTION. > > - if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY && > - ControlFile->checkPointCopy.redo < lastCheckPoint.redo) > - { > 7ff23c6 has removed the last call to CreateCheckpoint() outside the > checkpointer, meaning that there is one less concurrent race to worry > about, but I have to admit that this change, to update the control Sure. In very early stage the reasoning to rmove the code was it. And the rason for proposing to back-patch the same to older versions is basing on the further investigation, and I'm not fully confident that for the earlier versions. > file's checkPoint and checkPointCopy even if we don't check after > ControlFile->checkPointCopy.redo < lastCheckPoint.redo would make the > code less robust in ~14. So I am questioning whether a backpatch > is actually worth the risk here. Agreed. regards. -- Kyotaro Horiguchi NTT Open Source Software Center