This is where things starts getting subtle.

If Cassandra's failure detector knows ahead of time that not enough
writes are available, that is the only time we truly fail a write, and
nothing will be written anywhere.  But if a write starts during the
window where a node is failed but we don't know it yet, then it will
return TimedOutException.

This is commonly called a "failed write" but that is incorrect -- the
write is in progress, but we can't guarantee it's been replicated to
the desired number of replicas.

It's important to note that even in this situation, quorum reads +
writes provide strong consistency.  ("Strong consistency" is defined
as "after an update completes, any subsequent access will return the
updated value.") Quorum eads will be unable to complete as well until
enough machines come back to satisfy the quorum, which is the same
number as needed to finish the write.  So either the original writer
retrying, or the first reader will cause the write to be completed,
after which we're on familiar ground.

Consider the simplest non-trivial quorum, where we are replicating to
nodes X, Y, and Z.  For the case we are interested in, the original
quorum write attempt must time out, so 2 of the 3 replicas (Y and Z)
are temporarily unavailable. The write is applied to one replica (X),
and the client gets a TimedOutException. The write is not failed, it
is not succeeded, it is in progress (and the client should retry,
because it doesn't know for sure that it was applied anywhere at all).

While Y and Z stay down, quorum reads will be rejected.

When they come back up*, a read could achieve a quorum as {X, Y} or
{X, Z} or {Y, Z}.

{Y, Z} is the more interesting case because neither has the new write
yet.  The client will get the old version back, which is fine
according to our contract since the write is still in-progress.  Read
repair will see the new version on X and send it to X and Y.  As soon
as it gets to one of those, the original write is complete, and all
subsequent reads will see the new version.

{X, Y} and {X, Z} are equivalent: one node with the write, and one
without. The read will recognize that X's version needs to be sent to
Z, and the write will be complete.  This read and all subsequent ones
will see the write.  (Z will be replicated to asynchronously via read
repair.)

*If only one comes back up, then you of course only have the {X, Y} or
{X, Z} case.

The important guarantee this gives you is that once one quorum read
sees the new value, all others will too.  You can't see the newest
version, then see an older version on a subsequent write, which is the
characteristic of non-strong consistency (and which you can see in
Cassandra, temporarily, with lower ConsistencyLevels).

On Tue, Feb 22, 2011 at 10:22 PM, tijoriwala.ritesh
<tijoriwala.rit...@gmail.com> wrote:
>
> Hi,
> I wanted to get details on how does cassandra do synchronous writes to W
> replicas (out of N)? Does it do a 2PC? If not, how does it deal with
> failures of of nodes before it gets to write to W replicas? If the
> orchestrating node cannot write to W nodes successfully, I guess it will
> fail the write operation but what happens to the completed writes on X (W >
> X) nodes?
>
> Thanks,
> Ritesh
> --
> View this message in context: 
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
> Nabble.com.
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Reply via email to