Have not read the whole thing just the time line. Couple of issues...

At t8 The request would not start as the CL level of nodes is not available, 
the write would not be written to node X. The client would get an 
UnavailableException. In response it should connect to a new coordinator and 
try again. 

At t12 if RR is enabled for the request the read is sent to all UP endpoints 
for the key. Once CL requests have returned (including the data / non digest 
request) the responses are repaired and a synchronous (to the read request) RR 
round is initiated. 

Once all the requests have responded they are compared again an async RR 
process is kicked off. So it seems that in a worse case scenario two round of 
RR are possible, one to make sure the correct data is returned for the request. 
And another to make sure that all UP replicas agree, as it may not be the case 
that all UP replicas were involved in completing the request. 

So as written, at t8 the write would have failed and not be stored on any 
nodes. So the write at t7 would not be lost.  

I think the crux of this example is the failure mode at t8, I'm assuming Alice 
is connected to node x:

1) if X is disconnected before the write starts, it will not start any write 
that requires Quorum CL. Write fails with Unavailable error. 
2) If X disconnects from the network *after* sending the write messages, and 
all messages are successfully  actioned (including a local write) the request 
will fail with a TimedOutException as < CL nodes will respond. 
3) If X disconnects from the cluster after sending the messages, and the 
messages it  sends are lost but the local write succeeds. The request will fail 
with a TimedOutException as < CL nodes will respond. 

In all these cases the request is considered to have failed. The client should 
connect to another node and try again. In the case of timeout the operation was 
not completed to the CL level you asked for. In the case of unavailable the 
operation was not started.

It can look like the RR conflict resolution is a little naive here, but it's 
less simple when you consider another scenario. The write at t8 failed at 
Quorum, and in your deployment the client cannot connect to another node in the 
cluster, so your code drops the CL down to ONE and gets the write done. You are 
happy that any nodes in Alice's partition see her write, and that those in Bens 
partition see he's. When things get back to normal you want the most recent 
write to what clients consistently see, not the most popular value. The 
Consistency section here http://wiki.apache.org/cassandra/ArchitectureOverview 
says the same, it's the most recent value.

I tend to think of Consistency as all clients getting the same response to the 
same query.  
   
Not sure if I've made things clearer, feel free to poke holes in my logic :)

Hope that helps.
Aaron
 

On 23 Apr 2011, at 09:02, Edward Capriolo wrote:

> On Fri, Apr 22, 2011 at 4:31 PM, Milind Parikh <milindpar...@gmail.com> wrote:
>> Is there a chance of getting manual conflict resolution in Cassandra?
>> Please see attachment for why this is important in some cases.
>> 
>> Regards
>> Milind
>> 
>> 
> 
> I think about this often. LDAP servers like SunOne have pluggable
> conflict resolution. I could see the read-repair algorithm being
> pluggable.

Reply via email to