Re: [HACKERS] Issues with Quorum Commit

Markus Wanner Fri, 08 Oct 2010 07:37:40 -0700

On 10/08/2010 12:05 PM, Dimitri Fontaine wrote:
> Markus Wanner <mar...@bluegap.ch> writes:
>> ..and a whole lot of manual work, that's prone to error for something
>> that could easily be automated
> 
> So, the master just crashed, first standby is dead and second ain't in
> sync. What's the easy and automated way out? Sorry, I need a hand here.


Thinking this through, I'm realizing that this can potentially work
automatically with three nodes in both cases. Each node needs to keep
track of whether or not it is (or became) the master - and when (lamport
timestamp, maybe, not necessarily wall clock). A new master might
continue to commit new transactions after a fail-over, without the old
master being able to record that fact (because it's down).

This means there's a different requirement after a full-cluster crash
(i.e. master failure and no up-to-date standby is available). With the
timeout, you absolutely need the former master to come back up again for
zero data loss, no matter what your quorum_commit setting was. To be
able to automatically tell who was the most recent master, you need to
query the state of all other nodes, because they could be a more recent
master. If that's not possible (or not feasible, because the replacement
part isn't currently available), you are at risk of data loss.

With the given three node scenario, the zero data loss guarantee only
holds true as long as either at least one node (that is in sync) is
running or if you can recover the former master after a full cluster crash.

When waiting forever, you only need one of the k nodes to come back up
again. You also need to query other nodes to find out which the k of N
nodes are, but being able to recovery (N - k + 1) nodes is sufficient to
figure that out. So any (k-1) nodes may fail, even permanently, at any
point in time, and you are still not at risk of losing data. (Nor at
risk of losing availability, BTW). I'm still of the opinion that that's
the way easier and clearer guarantee.

Also note that with higher values for N, this gets more and more
important, because the chance to be able to recovery all N nodes after a
full crash shrinks with increasing N (while the time required to do so
increases). But maybe the current sync rep feature doesn't need to
target setups with that many nodes.

I certainly agree that either way is complicated to implement. With
Postgtres-R, I'm clearly going the way that's able to satisfy large
numbers of nodes.

Thanks for an interesting discussion. And for respectful disagreement.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Issues with Quorum Commit

Reply via email to