Have not read the whole thing just the time line. Couple of issues... At t8 The request would not start as the CL level of nodes is not available, the write would not be written to node X. The client would get an UnavailableException. In response it should connect to a new coordinator and try again.
At t12 if RR is enabled for the request the read is sent to all UP endpoints for the key. Once CL requests have returned (including the data / non digest request) the responses are repaired and a synchronous (to the read request) RR round is initiated. Once all the requests have responded they are compared again an async RR process is kicked off. So it seems that in a worse case scenario two round of RR are possible, one to make sure the correct data is returned for the request. And another to make sure that all UP replicas agree, as it may not be the case that all UP replicas were involved in completing the request. So as written, at t8 the write would have failed and not be stored on any nodes. So the write at t7 would not be lost. I think the crux of this example is the failure mode at t8, I'm assuming Alice is connected to node x: 1) if X is disconnected before the write starts, it will not start any write that requires Quorum CL. Write fails with Unavailable error. 2) If X disconnects from the network *after* sending the write messages, and all messages are successfully actioned (including a local write) the request will fail with a TimedOutException as < CL nodes will respond. 3) If X disconnects from the cluster after sending the messages, and the messages it sends are lost but the local write succeeds. The request will fail with a TimedOutException as < CL nodes will respond. In all these cases the request is considered to have failed. The client should connect to another node and try again. In the case of timeout the operation was not completed to the CL level you asked for. In the case of unavailable the operation was not started. It can look like the RR conflict resolution is a little naive here, but it's less simple when you consider another scenario. The write at t8 failed at Quorum, and in your deployment the client cannot connect to another node in the cluster, so your code drops the CL down to ONE and gets the write done. You are happy that any nodes in Alice's partition see her write, and that those in Bens partition see he's. When things get back to normal you want the most recent write to what clients consistently see, not the most popular value. The Consistency section here http://wiki.apache.org/cassandra/ArchitectureOverview says the same, it's the most recent value. I tend to think of Consistency as all clients getting the same response to the same query. Not sure if I've made things clearer, feel free to poke holes in my logic :) Hope that helps. Aaron On 23 Apr 2011, at 09:02, Edward Capriolo wrote: > On Fri, Apr 22, 2011 at 4:31 PM, Milind Parikh <milindpar...@gmail.com> wrote: >> Is there a chance of getting manual conflict resolution in Cassandra? >> Please see attachment for why this is important in some cases. >> >> Regards >> Milind >> >> > > I think about this often. LDAP servers like SunOne have pluggable > conflict resolution. I could see the read-repair algorithm being > pluggable.