Re: Timeline issue if StartupXLOG() is interrupted right before end-of-recovery record is done

Andrey M. Borodin Sat, 18 Jan 2025 09:14:14 -0800

Hi Roman!
Thanks for raising the issue. I think the root cause is that many systems imply 
that higher number of timeline id means more recent timeline write. This 
invariant is not uphold. It's not even more recent timeline start.
"latest timeline" effectively means "random timeline".

> On 17 Jan 2025, at 06:05, Roman Eskin <r.es...@arenadata.io> wrote:
> 
> 5. Switch back instance_1 and instance_2 to the original
> configuration. And here, when we try to start instance_2 as Replica,
> we'll get a FATAL:
> "FATAL: requested timeline 2 is not a child of this server's history
> DETAIL: Latest checkpoint is at 0/303FF90 on timeline 1, but in the
> history of the requested timeline, the server forked off from that
> timeline at 0/3023538."

I think here you can just specify target timeline for the standby instance_1 
and it will continue recovery from instance_2.

Having say that, I must admit that we observe something similar approximately 2 
times a week, tried several fixes, but still have to live with it.
In our case we have a "resetup" cron job, which will automatically rebuild 
replica from backup if Postgres cannot start recovery for some hours.
So in our case this looks like extra 3 hours of standby downtime.

I'm not sure if this is a result of pgconsul not setting up target timeline or 
some other error...

Persisting recovery signal file for some _timeout_ seems super dangerous to me. 
In distributed systems every extra _timeout_ is a source of complexity, 
uncertainty and despair.

Thanks!

Best regards, Andrey Borodin.

Re: Timeline issue if StartupXLOG() is interrupted right before end-of-recovery record is done

Reply via email to