Re: How does Cassandra handle failure during synchronous writes

Ritesh Tijoriwala Wed, 23 Feb 2011 18:01:42 -0800

Thanks Narendra. This is exactly what I was looking for. So the read will
return with old value but at the same time, repair will occur and next reads
will return "new value". But the new value was never written successfully in
the first place as Quorum was never achieved. Isn't that semantically
incorrect?
Taking configuration of cluster size = 3 and RF = 3 as you described with
Read/Write CL = Quorum,


0. Current value for some key K = W.
1. Client writes K = X. Unfortunately, due to intermittent network error,
writes cannot be done successfully on quorum nodes (say node 2 or node 3).
Node 1 has written successfully the value of X for K. Hence, a failure is
returned to the client. If this X gets written for some unknown reason
behind the scene, how the client is suppose to know this? This sounds like a
major design flaw. For e.g. consider withdrawing $500 from account B. If
client is told that withdrawal cannot succeed, he will try again just to
find out that his account is in overdraft state even though the consistency
level he was using is Read/Write Consistent with Quorum.

On step 2 after 1, when client asks for K, I agree that W should be returned
but at the same time, I don't know if silently propagating the failed value
to rest of the nodes is the right behavior.

Thanks,
Ritesh


On Wed, Feb 23, 2011 at 4:47 PM, Narendra Sharma
<narendra.sha...@gmail.com>wrote:

> Remember the simple rule. Column with highest timestamp is the one that
> will be considered correct EVENTUALLY. So consider following case:
>
> Cluster size = 3 (say node1, node2 and node3), RF = 3, Read/Write CL =
> QUORUM
> a. QUORUM in this case requires 2 nodes. Write failed with successful write
> to only 1 node say node1.
> b. Read with CL = QUORUM. If read hits node2 and node3, old data will be
> returned with read repair triggered in background. On next read you will get
> the data that was written to node1.
> c. Read with CL = QUORUM. If read hits node1 and node2/node3, new data that
> was written to node1 will be returned.
>
> HTH!
>
> Thanks,
> Naren
>
>
>
> On Wed, Feb 23, 2011 at 3:36 PM, Ritesh Tijoriwala <
> tijoriwala.rit...@gmail.com> wrote:
>
>> Hi Anthony,
>> I am not talking about the case of CL ANY. I am talking about the case
>> where your consistency level is  R + W > N and you want to write to W nodes
>> but only succeed in writing to X ( where X < W) nodes and hence fail the
>> write to the client.
>>
>> thanks,
>> Ritesh
>>
>> On Wed, Feb 23, 2011 at 2:48 PM, Anthony John <chirayit...@gmail.com>wrote:
>>
>>> Ritesh,
>>>
>>> At CL ANY - if all endpoints are down - a HH is written. And it is a
>>> successful write - not a failed write.
>>>
>>> Now that does not guarantee a READ of the value just written - but that
>>> is a risk that you take when you use the ANY CL!
>>>
>>> HTH,
>>>
>>> -JA
>>>
>>>
>>> On Wed, Feb 23, 2011 at 4:40 PM, Ritesh Tijoriwala <
>>> tijoriwala.rit...@gmail.com> wrote:
>>>
>>>> hi Anthony,
>>>> While you stated the facts right, I don't see how it relates to the
>>>> question I ask. Can you elaborate specifically what happens in the case I
>>>> mentioned above to Dave?
>>>>
>>>> thanks,
>>>> Ritesh
>>>>
>>>>
>>>> On Wed, Feb 23, 2011 at 1:57 PM, Anthony John <chirayit...@gmail.com>wrote:
>>>>
>>>>> Seems to me that the explanations are getting incredibly complicated -
>>>>> while I submit the real issue is not!
>>>>>
>>>>> Salient points here:-
>>>>> 1. To be guaranteed data consistency - the writes and reads have to be
>>>>> at Quorum CL or more
>>>>> 2. Any W/R at lesser CL means that the application has to handle the
>>>>> inconsistency, or has to be tolerant of it
>>>>> 3. Writing at "ANY" CL - a special case - means that writes will always
>>>>> go through (as long as any node is up), even if the destination nodes are
>>>>> not up. This is done via hinted handoff. But this can result in 
>>>>> inconsistent
>>>>> reads, and yes that is a problem but refer to pt-2 above
>>>>> 4. At QUORUM CL R/W - after Quorum is met, hinted handoffs are used to
>>>>> handle that case where a particular node is down and the write needs to be
>>>>> replicated to it. But this will not cause inconsistent R as the hinted
>>>>> handoff (in this case) only applies after Quorum is met - so a Quorum R is
>>>>> not dependent on the down node being up, and having got the hint.
>>>>>
>>>>> Hope I state this appropriately!
>>>>>
>>>>> HTH,
>>>>>
>>>>> -JA
>>>>>
>>>>>
>>>>> On Wed, Feb 23, 2011 at 3:39 PM, Ritesh Tijoriwala <
>>>>> tijoriwala.rit...@gmail.com> wrote:
>>>>>
>>>>>> > Read repair will probably occur at that point (depending on your
>>>>>> config), which would cause the newest value to propagate to more 
>>>>>> replicas.
>>>>>>
>>>>>> Is the newest value the "quorum" value which means it is the old value
>>>>>> that will be written back to the nodes having "newer non-quorum" value or
>>>>>> the newest value is the real new value? :) If later, than this seems 
>>>>>> kind of
>>>>>> odd to me and how it will be useful to any application. A bug?
>>>>>>
>>>>>> Thanks,
>>>>>> Ritesh
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 23, 2011 at 12:43 PM, Dave Revell <d...@meebo-inc.com>wrote:
>>>>>>
>>>>>>> Ritesh,
>>>>>>>
>>>>>>> You have seen the problem. Clients may read the newly written value
>>>>>>> even though the client performing the write saw it as a failure. When 
>>>>>>> the
>>>>>>> client reads, it will use the correct number of replicas for the chosen 
>>>>>>> CL,
>>>>>>> then return the newest value seen at any replica. This "newest value" 
>>>>>>> could
>>>>>>> be the result of a failed write.
>>>>>>>
>>>>>>> Read repair will probably occur at that point (depending on your
>>>>>>> config), which would cause the newest value to propagate to more 
>>>>>>> replicas.
>>>>>>>
>>>>>>> R+W>N guarantees serial order of operations: any read at CL=R that
>>>>>>> occurs after a write at CL=W will observe the write. I don't think this
>>>>>>> property is relevant to your current question, though.
>>>>>>>
>>>>>>> Cassandra has no mechanism to "roll back" the partial write, other
>>>>>>> than to simply write again. This may also fail.
>>>>>>>
>>>>>>> Best,
>>>>>>> Dave
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 23, 2011 at 10:12 AM, <tijoriwala.rit...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Hi Dave,
>>>>>>>> Thanks for your input. In the steps you mention, what happens when
>>>>>>>> client tries to read the value at step 6? Is it possible that the 
>>>>>>>> client may
>>>>>>>> see the new value? My understanding was if R + W > N, then client will 
>>>>>>>> not
>>>>>>>> see the new value as Quorum nodes will not agree on the new value. If 
>>>>>>>> that
>>>>>>>> is the case, then its alright to return failure to the client. 
>>>>>>>> However, if
>>>>>>>> not, then it is difficult to program as after every failure, you as an
>>>>>>>> client are not sure if failure is a pseudo failure with some side 
>>>>>>>> effects or
>>>>>>>> real failure.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ritesh
>>>>>>>>
>>>>>>>> <quote author='Dave Revell'>
>>>>>>>>
>>>>>>>> Ritesh,
>>>>>>>>
>>>>>>>> There is no commit protocol. Writes may be persisted on some
>>>>>>>> replicas even
>>>>>>>> though the quorum fails. Here's a sequence of events that shows the
>>>>>>>> "problem:"
>>>>>>>>
>>>>>>>> 1. Some replica R fails, but recently, so its failure has not yet
>>>>>>>> been
>>>>>>>> detected
>>>>>>>> 2. A client writes with consistency > 1
>>>>>>>> 3. The write goes to all replicas, all replicas except R persist the
>>>>>>>> write
>>>>>>>> to disk
>>>>>>>> 4. Replica R never responds
>>>>>>>> 5. Failure is returned to the client, but the new value is still in
>>>>>>>> the
>>>>>>>> cluster, on all replicas except R.
>>>>>>>>
>>>>>>>> Something very similar could happen for CL QUORUM.
>>>>>>>>
>>>>>>>> This is a conscious design decision because a commit protocol would
>>>>>>>> constitute tight coupling between nodes, which goes against the
>>>>>>>> Cassandra
>>>>>>>> philosophy. But unfortunately you do have to write your app with
>>>>>>>> this case
>>>>>>>> in mind.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Dave
>>>>>>>>
>>>>>>>> On Tue, Feb 22, 2011 at 8:22 PM, tijoriwala.ritesh <
>>>>>>>> tijoriwala.rit...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> >
>>>>>>>> > Hi,
>>>>>>>> > I wanted to get details on how does cassandra do synchronous
>>>>>>>> writes to W
>>>>>>>> > replicas (out of N)? Does it do a 2PC? If not, how does it deal
>>>>>>>> with
>>>>>>>> > failures of of nodes before it gets to write to W replicas? If the
>>>>>>>> > orchestrating node cannot write to W nodes successfully, I guess
>>>>>>>> it will
>>>>>>>> > fail the write operation but what happens to the completed writes
>>>>>>>> on X (W
>>>>>>>> > >
>>>>>>>> > X) nodes?
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> > Ritesh
>>>>>>>> > --
>>>>>>>> > View this message in context:
>>>>>>>> >
>>>>>>>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html
>>>>>>>> > Sent from the cassandra-u...@incubator.apache.org mailing list
>>>>>>>> archive at
>>>>>>>> > Nabble.com.
>>>>>>>> >
>>>>>>>>
>>>>>>>> </quote>
>>>>>>>> Quoted from:
>>>>>>>>
>>>>>>>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055408.html
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How does Cassandra handle failure during synchronous writes

Reply via email to