Re: PANIC during crash recovery of a recently promoted standby

Pavan Deolasee Mon, 14 May 2018 00:46:00 -0700

On Fri, May 11, 2018 at 8:39 PM, Alvaro Herrera <alvhe...@2ndquadrant.com>
wrote:


> Michael Paquier wrote:
> > On Thu, May 10, 2018 at 10:52:12AM +0530, Pavan Deolasee wrote:
> > > I propose that we should always clear the minRecoveryPoint after
> promotion
> > > to ensure that crash recovery always run to the end if a just-promoted
> > > standby crashes before completing its first regular checkpoint. A WIP
> patch
> > > is attached.
> >
> > I have been playing with your patch and upgraded the test to check as
> > well for cascading standbys.  We could use that in the final patch.
> > That's definitely something to add in the recovery test suite, and the
> > sleep phases should be replaced by waits on replay and/or flush.
> >
> > Still, that approach looks sensitive to me.  A restart point could be
> > running while the end-of-recovery record is inserted, so your patch
> > could update minRecoveryPoint to InvalidXLogRecPtr, and then a restart
> > point would happily update again the control file's minRecoveryPoint to
> > lastCheckPointEndPtr because it would see that the former is older than
> > lastCheckPointEndPtr (let's not forget that InvalidXLogRecPtr is 0), so
> > you could still crash on invalid pages?
>
> Yeah, I had this exact comment, but I was unable to come up with a test
> case that would cause a problem.
>

Looks like I didn't understand Alvaro's comment when he mentioned it to me
off-list. But I now see what Michael and Alvaro mean and that indeed seems
like a problem. I was thinking that the test for (ControlFile->state ==
DB_IN_ARCHIVE_RECOVERY) will ensure that minRecoveryPoint can't be updated
after the standby is promoted. While that's true for a DB_IN_PRODUCTION,  the
RestartPoint may finish after we have written end-of-recovery record, but
before we're in production and thus the minRecoveryPoint may again be set.


>
> > I need to think a bit more about that stuff, but one idea would be to
> > use a special state in the control file to mark it as ending recovery,
> > this way we would control race conditions with restart points.
>
> Hmm.  Can we change the control file in released branches?  (It should
> be possible to make the new server understand both old and new formats,
> but I think this is breaking new ground and it looks easy to introduce
> more bugs there.)
>
>
Can't we just remember that in shared memory state instead of writing to
the control file? So if we've already performed end-of-recovery, we don't
update the minRecoveryPoint when RestartPoint completes.

Thanks,
Pavan

-- 
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: PANIC during crash recovery of a recently promoted standby

Reply via email to