Hi, The FATAL error "recovery ended before configured recovery target was reached" introduced by commit at [1] in PG 14 is causing the standby to go down after having spent a good amount of time in recovery. There can be cases where the arrival of required WAL (for reaching recovery target) from the archive location to the standby may take time and meanwhile the standby failing with the FATAL error isn't good. Instead, how about we make the standby wait for a certain amount of time (with a GUC) so that it can keep looking for the required WAL. If it gets the required WAL during the wait time, then it succeeds in reaching the recovery target (no FATAL error of course). If it doesn't, the timeout occurs and the standby fails with the FATAL error. The value of the new GUC can probably be set to the average time it takes for the WAL to reach archive location from the primary + from archive location to the standby, default 0 i.e. disabled.
I'm attaching a WIP patch. I've tested it on my dev system and the recovery regression tests are passing with it. I will provide a better version later, probably with a test case. Thoughts? [1] commit dc788668bb269b10a108e87d14fefd1b9301b793 Author: Peter Eisentraut <pe...@eisentraut.org> Date: Wed Jan 29 15:43:32 2020 +0100 Fail if recovery target is not reached Before, if a recovery target is configured, but the archive ended before the target was reached, recovery would end and the server would promote without further notice. That was deemed to be pretty wrong. With this change, if the recovery target is not reached, it is a fatal error. Based-on-patch-by: Leif Gunnar Erlandsen <l...@lako.no> Reviewed-by: Kyotaro Horiguchi <horikyota....@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/993736dd3f1713ec1f63fc3b65383...@lako.no Regards, Bharath Rupireddy.
v1-0001-add-retry-mechanism-with-a-GUC-before-failing-the.patch
Description: Binary data