Hi Roman! Thanks for raising the issue. I think the root cause is that many systems imply that higher number of timeline id means more recent timeline write. This invariant is not uphold. It's not even more recent timeline start. "latest timeline" effectively means "random timeline".
> On 17 Jan 2025, at 06:05, Roman Eskin <r.es...@arenadata.io> wrote: > > 5. Switch back instance_1 and instance_2 to the original > configuration. And here, when we try to start instance_2 as Replica, > we'll get a FATAL: > "FATAL: requested timeline 2 is not a child of this server's history > DETAIL: Latest checkpoint is at 0/303FF90 on timeline 1, but in the > history of the requested timeline, the server forked off from that > timeline at 0/3023538." I think here you can just specify target timeline for the standby instance_1 and it will continue recovery from instance_2. Having say that, I must admit that we observe something similar approximately 2 times a week, tried several fixes, but still have to live with it. In our case we have a "resetup" cron job, which will automatically rebuild replica from backup if Postgres cannot start recovery for some hours. So in our case this looks like extra 3 hours of standby downtime. I'm not sure if this is a result of pgconsul not setting up target timeline or some other error... Persisting recovery signal file for some _timeout_ seems super dangerous to me. In distributed systems every extra _timeout_ is a source of complexity, uncertainty and despair. Thanks! Best regards, Andrey Borodin.