This is where things starts getting subtle. If Cassandra's failure detector knows ahead of time that not enough writes are available, that is the only time we truly fail a write, and nothing will be written anywhere. But if a write starts during the window where a node is failed but we don't know it yet, then it will return TimedOutException.
This is commonly called a "failed write" but that is incorrect -- the write is in progress, but we can't guarantee it's been replicated to the desired number of replicas. It's important to note that even in this situation, quorum reads + writes provide strong consistency. ("Strong consistency" is defined as "after an update completes, any subsequent access will return the updated value.") Quorum eads will be unable to complete as well until enough machines come back to satisfy the quorum, which is the same number as needed to finish the write. So either the original writer retrying, or the first reader will cause the write to be completed, after which we're on familiar ground. Consider the simplest non-trivial quorum, where we are replicating to nodes X, Y, and Z. For the case we are interested in, the original quorum write attempt must time out, so 2 of the 3 replicas (Y and Z) are temporarily unavailable. The write is applied to one replica (X), and the client gets a TimedOutException. The write is not failed, it is not succeeded, it is in progress (and the client should retry, because it doesn't know for sure that it was applied anywhere at all). While Y and Z stay down, quorum reads will be rejected. When they come back up*, a read could achieve a quorum as {X, Y} or {X, Z} or {Y, Z}. {Y, Z} is the more interesting case because neither has the new write yet. The client will get the old version back, which is fine according to our contract since the write is still in-progress. Read repair will see the new version on X and send it to X and Y. As soon as it gets to one of those, the original write is complete, and all subsequent reads will see the new version. {X, Y} and {X, Z} are equivalent: one node with the write, and one without. The read will recognize that X's version needs to be sent to Z, and the write will be complete. This read and all subsequent ones will see the write. (Z will be replicated to asynchronously via read repair.) *If only one comes back up, then you of course only have the {X, Y} or {X, Z} case. The important guarantee this gives you is that once one quorum read sees the new value, all others will too. You can't see the newest version, then see an older version on a subsequent write, which is the characteristic of non-strong consistency (and which you can see in Cassandra, temporarily, with lower ConsistencyLevels). On Tue, Feb 22, 2011 at 10:22 PM, tijoriwala.ritesh <tijoriwala.rit...@gmail.com> wrote: > > Hi, > I wanted to get details on how does cassandra do synchronous writes to W > replicas (out of N)? Does it do a 2PC? If not, how does it deal with > failures of of nodes before it gets to write to W replicas? If the > orchestrating node cannot write to W nodes successfully, I guess it will > fail the write operation but what happens to the completed writes on X (W > > X) nodes? > > Thanks, > Ritesh > -- > View this message in context: > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html > Sent from the cassandra-u...@incubator.apache.org mailing list archive at > Nabble.com. > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com