> On 21 Jan 2025, at 16:47, Roman Eskin <r.es...@arenadata.io> wrote:
> 
>> 
>> Persisting recovery signal file for some _timeout_ seems super dangerous to 
>> me. In distributed systems every extra _timeout_ is a source of complexity, 
>> uncertainty and despair.
> 
> The approach is not about persisting the signal files for some timeout. 
> Currently the files are removed in StartupXLOG() before 
> writeTimeLineHistory() and PerformRecoveryXLogAction() are called. The 
> suggestion is to move the file removal after PerformRecoveryXLogAction() 
> inside StartupXLOG().

Sending node to repeated promote-fail cycle without resolving root cause seems 
like even less appealing idea.
If something prevented promotion, why we should retry by this particular method?

Even in case of transient failure which you described - power loss - it does 
not sound like a very good idea to retry promotion after returning online. The 
user will get unexpected splitbrain.


Best regards, Andrey Borodin.

Reply via email to