On 10/08/2010 12:05 PM, Dimitri Fontaine wrote: > Markus Wanner <mar...@bluegap.ch> writes: >> ..and a whole lot of manual work, that's prone to error for something >> that could easily be automated > > So, the master just crashed, first standby is dead and second ain't in > sync. What's the easy and automated way out? Sorry, I need a hand here.
Thinking this through, I'm realizing that this can potentially work automatically with three nodes in both cases. Each node needs to keep track of whether or not it is (or became) the master - and when (lamport timestamp, maybe, not necessarily wall clock). A new master might continue to commit new transactions after a fail-over, without the old master being able to record that fact (because it's down). This means there's a different requirement after a full-cluster crash (i.e. master failure and no up-to-date standby is available). With the timeout, you absolutely need the former master to come back up again for zero data loss, no matter what your quorum_commit setting was. To be able to automatically tell who was the most recent master, you need to query the state of all other nodes, because they could be a more recent master. If that's not possible (or not feasible, because the replacement part isn't currently available), you are at risk of data loss. With the given three node scenario, the zero data loss guarantee only holds true as long as either at least one node (that is in sync) is running or if you can recover the former master after a full cluster crash. When waiting forever, you only need one of the k nodes to come back up again. You also need to query other nodes to find out which the k of N nodes are, but being able to recovery (N - k + 1) nodes is sufficient to figure that out. So any (k-1) nodes may fail, even permanently, at any point in time, and you are still not at risk of losing data. (Nor at risk of losing availability, BTW). I'm still of the opinion that that's the way easier and clearer guarantee. Also note that with higher values for N, this gets more and more important, because the chance to be able to recovery all N nodes after a full crash shrinks with increasing N (while the time required to do so increases). But maybe the current sync rep feature doesn't need to target setups with that many nodes. I certainly agree that either way is complicated to implement. With Postgtres-R, I'm clearly going the way that's able to satisfy large numbers of nodes. Thanks for an interesting discussion. And for respectful disagreement. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers