Re: Synchronous commit behavior during network outage

Andrey Borodin Thu, 01 Jul 2021 23:40:04 -0700

> 2 июля 2021 г., в 10:59, Jeff Davis <[email protected]> написал(а):
> 
> On Wed, 2021-06-30 at 17:28 +0500, Andrey Borodin wrote:
>>> My patch also covers the backend termination case. Is there a
>>> reason
>>> you left that case out?
>> 
>> Yes, backend termination is used by HA tool before rewinding the
>> node.
> 
> Can't you just disable sync rep first (using ALTER SYSTEM SET
> synchronous_standby_names=''), which will unstick the backend, and then
> terminate it?
If the failover happens due to unresponsive node we cannot just turn off sync 
rep. We need to have some spare connections for that (number of stuck backends 
will skyrocket during network partitioning). We need available descriptors and 
some memory to fork new backend. We will need to re-read config. We need time 
to try after all.
At some failures we may lack some of these.

Partial degradation is already hard task. Without ability to easily terminate 
running Postgres HA tool will often resort to SIGKILL.

> 
> If you don't handle the termination case, then there's still a chance
> for the transaction to become visible to other clients before its
> replicated.
Termination is admin command, they know what they are doing.
Cancelation is part of user protocol.

BTW can we have two GUCs? So that HA tool developers will decide on their own 
which guaranties they provide?

> 
>> There is one more caveat we need to fix: we should prevent instant
>> recovery from happening.
> 
> That can already be done with the restart_after_crash GUC.

Oh, I didn't know it, we will use it. Thanks!


Best regards, Andrey Borodin.
Re: Synchronous commit behavior during network outage

Reply via email to