On Sat, Jun 4, 2022 at 9:39 AM Bharath Rupireddy <bharath.rupireddyforpostg...@gmail.com> wrote: > > On Sat, Jun 4, 2022 at 6:29 PM James Coleman <jtc...@gmail.com> wrote: > > > > A few weeks back I sent a bug report [1] directly to the -bugs mailing > > list, and I haven't seen any activity on it (maybe this is because I > > emailed directly instead of using the form?), but I got some time to > > take a look and concluded that a first-level fix is pretty simple. > > > > A quick background refresher: after promoting a standby rewinding the > > former primary requires that a checkpoint have been completed on the > > new primary after promotion. This is correctly documented. However > > pg_rewind incorrectly reports to the user that a rewind isn't > > necessary because the source and target are on the same timeline. > > > > Specifically, this happens when the control file on the newly promoted > > server looks like: > > > > Latest checkpoint's TimeLineID: 4 > > Latest checkpoint's PrevTimeLineID: 4 > > ... > > Min recovery ending loc's timeline: 5 > > > > Attached is a patch that detects this condition and reports it as an > > error to the user. > > > > In the spirit of the new-ish "ensure shutdown" functionality I could > > imagine extending this to automatically issue a checkpoint when this > > situation is detected. I haven't started to code that up, however, > > wanting to first get buy-in on that. > > > > 1: > > https://www.postgresql.org/message-id/CAAaqYe8b2DBbooTprY4v=bized9qbqvlq+fd9j617eqfjk1...@mail.gmail.com > > Thanks. I had a quick look over the issue and patch - just a thought - > can't we let pg_rewind issue a checkpoint on the new primary instead > of erroring out, maybe optionally? It might sound too much, but helps > pg_rewind to be self-reliant i.e. avoiding external actor to detect > the error and issue checkpoint the new primary to be able to > successfully run pg_rewind on the pld primary and repair it to use it > as a new standby.
That's what I had suggested as a "further improvement" option in the last paragraph :) But I think agreement on this more basic solution would still be good (even if I add the automatic checkpointing in this thread); given we currently explicitly mis-inform the user of pg_rewind, I think this is a bug that should be considered for backpatching, and the simpler "fail if detected" patch is probably the only thing we could backpatch. Thanks for taking a look, James Coleman