Re: Synchronous commit behavior during network outage

Andrey Borodin Wed, 30 Jun 2021 05:28:47 -0700

> 29 июня 2021 г., в 23:35, Jeff Davis <[email protected]> написал(а):
> 
> On Tue, 2021-06-29 at 11:48 +0500, Andrey Borodin wrote:
>>> 29 июня 2021 г., в 03:56, Jeff Davis <[email protected]>
>>> написал(а):
>>> 
>>> The patch may be somewhat controversial, so I'll wait for feedback
>>> before documenting it properly.
>> 
>> The patch seems similar to [0]. But I like your wording :)
>> I'd be happy if we go with any version of these idea.
> 
> Thank you, somehow I missed that one, we should combine the CF entries.
> 
> My patch also covers the backend termination case. Is there a reason
> you left that case out?
Yes, backend termination is used by HA tool before rewinding the node. 
Initially I was considering termination as PANIC and got a ton of coredumps 
during failovers on drills.

There is one more caveat we need to fix: we should prevent instant recovery 
from happening. HA tool must know that our process was restarted. 
Consider following scenario:
1. Node A is primary with sync rep.
2. A is going through network partitioning, somewhere node B is promoted.
3. All backends of A are stuck in sync rep, until HA tool discovers A is failed 
node.
4. One backend crashes with segfault in some buggy extension or OOM or whatever
5. Postgres server is doing restartless crash recovery making 
local-but-not-replicated data visible.

We should prevent 5 also as we prevent cancels. HA tool will discover 
postmaster fail and will recheck in coordinatino system that it can raise up 
Postgres locally.

Thanks!

Best regards, Andrey Borodin.
Re: Synchronous commit behavior during network outage

Reply via email to