Re: How does Cassandra handle failure during synchronous writes

Anthony John Wed, 23 Feb 2011 18:06:25 -0800

>Remember the simple rule. Column with highest timestamp is the one that
will be considered correct EVENTUALLY. So consider following case:


I am sorry, that will return inconsistent results even a Q. Time stamp have
nothing to do with this. It is just an application provided artifact and
could be anything.

>c. Read with CL = QUORUM. If read hits node1 and node2/node3, new data that
was written to node1 will be returned.

In this case - N1 will be identified as a discrepancy and the change will be
discarded via read repair

On Wed, Feb 23, 2011 at 6:47 PM, Narendra Sharma
<narendra.sha...@gmail.com>wrote:

> Remember the simple rule. Column with highest timestamp is the one that
> will be considered correct EVENTUALLY. So consider following case:
>
> Cluster size = 3 (say node1, node2 and node3), RF = 3, Read/Write CL =
> QUORUM
> a. QUORUM in this case requires 2 nodes. Write failed with successful write
> to only 1 node say node1.
> b. Read with CL = QUORUM. If read hits node2 and node3, old data will be
> returned with read repair triggered in background. On next read you will get
> the data that was written to node1.
> c. Read with CL = QUORUM. If read hits node1 and node2/node3, new data that
> was written to node1 will be returned.
>
> HTH!
>
> Thanks,
> Naren
>
>
>
> On Wed, Feb 23, 2011 at 3:36 PM, Ritesh Tijoriwala <
> tijoriwala.rit...@gmail.com> wrote:
>
>> Hi Anthony,
>> I am not talking about the case of CL ANY. I am talking about the case
>> where your consistency level is  R + W > N and you want to write to W nodes
>> but only succeed in writing to X ( where X < W) nodes and hence fail the
>> write to the client.
>>
>> thanks,
>> Ritesh
>>
>> On Wed, Feb 23, 2011 at 2:48 PM, Anthony John <chirayit...@gmail.com>wrote:
>>
>>> Ritesh,
>>>
>>> At CL ANY - if all endpoints are down - a HH is written. And it is a
>>> successful write - not a failed write.
>>>
>>> Now that does not guarantee a READ of the value just written - but that
>>> is a risk that you take when you use the ANY CL!
>>>
>>> HTH,
>>>
>>> -JA
>>>
>>>
>>> On Wed, Feb 23, 2011 at 4:40 PM, Ritesh Tijoriwala <
>>> tijoriwala.rit...@gmail.com> wrote:
>>>
>>>> hi Anthony,
>>>> While you stated the facts right, I don't see how it relates to the
>>>> question I ask. Can you elaborate specifically what happens in the case I
>>>> mentioned above to Dave?
>>>>
>>>> thanks,
>>>> Ritesh
>>>>
>>>>
>>>> On Wed, Feb 23, 2011 at 1:57 PM, Anthony John <chirayit...@gmail.com>wrote:
>>>>
>>>>> Seems to me that the explanations are getting incredibly complicated -
>>>>> while I submit the real issue is not!
>>>>>
>>>>> Salient points here:-
>>>>> 1. To be guaranteed data consistency - the writes and reads have to be
>>>>> at Quorum CL or more
>>>>> 2. Any W/R at lesser CL means that the application has to handle the
>>>>> inconsistency, or has to be tolerant of it
>>>>> 3. Writing at "ANY" CL - a special case - means that writes will always
>>>>> go through (as long as any node is up), even if the destination nodes are
>>>>> not up. This is done via hinted handoff. But this can result in 
>>>>> inconsistent
>>>>> reads, and yes that is a problem but refer to pt-2 above
>>>>> 4. At QUORUM CL R/W - after Quorum is met, hinted handoffs are used to
>>>>> handle that case where a particular node is down and the write needs to be
>>>>> replicated to it. But this will not cause inconsistent R as the hinted
>>>>> handoff (in this case) only applies after Quorum is met - so a Quorum R is
>>>>> not dependent on the down node being up, and having got the hint.
>>>>>
>>>>> Hope I state this appropriately!
>>>>>
>>>>> HTH,
>>>>>
>>>>> -JA
>>>>>
>>>>>
>>>>> On Wed, Feb 23, 2011 at 3:39 PM, Ritesh Tijoriwala <
>>>>> tijoriwala.rit...@gmail.com> wrote:
>>>>>
>>>>>> > Read repair will probably occur at that point (depending on your
>>>>>> config), which would cause the newest value to propagate to more 
>>>>>> replicas.
>>>>>>
>>>>>> Is the newest value the "quorum" value which means it is the old value
>>>>>> that will be written back to the nodes having "newer non-quorum" value or
>>>>>> the newest value is the real new value? :) If later, than this seems 
>>>>>> kind of
>>>>>> odd to me and how it will be useful to any application. A bug?
>>>>>>
>>>>>> Thanks,
>>>>>> Ritesh
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 23, 2011 at 12:43 PM, Dave Revell <d...@meebo-inc.com>wrote:
>>>>>>
>>>>>>> Ritesh,
>>>>>>>
>>>>>>> You have seen the problem. Clients may read the newly written value
>>>>>>> even though the client performing the write saw it as a failure. When 
>>>>>>> the
>>>>>>> client reads, it will use the correct number of replicas for the chosen 
>>>>>>> CL,
>>>>>>> then return the newest value seen at any replica. This "newest value" 
>>>>>>> could
>>>>>>> be the result of a failed write.
>>>>>>>
>>>>>>> Read repair will probably occur at that point (depending on your
>>>>>>> config), which would cause the newest value to propagate to more 
>>>>>>> replicas.
>>>>>>>
>>>>>>> R+W>N guarantees serial order of operations: any read at CL=R that
>>>>>>> occurs after a write at CL=W will observe the write. I don't think this
>>>>>>> property is relevant to your current question, though.
>>>>>>>
>>>>>>> Cassandra has no mechanism to "roll back" the partial write, other
>>>>>>> than to simply write again. This may also fail.
>>>>>>>
>>>>>>> Best,
>>>>>>> Dave
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 23, 2011 at 10:12 AM, <tijoriwala.rit...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Hi Dave,
>>>>>>>> Thanks for your input. In the steps you mention, what happens when
>>>>>>>> client tries to read the value at step 6? Is it possible that the 
>>>>>>>> client may
>>>>>>>> see the new value? My understanding was if R + W > N, then client will 
>>>>>>>> not
>>>>>>>> see the new value as Quorum nodes will not agree on the new value. If 
>>>>>>>> that
>>>>>>>> is the case, then its alright to return failure to the client. 
>>>>>>>> However, if
>>>>>>>> not, then it is difficult to program as after every failure, you as an
>>>>>>>> client are not sure if failure is a pseudo failure with some side 
>>>>>>>> effects or
>>>>>>>> real failure.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ritesh
>>>>>>>>
>>>>>>>> <quote author='Dave Revell'>
>>>>>>>>
>>>>>>>> Ritesh,
>>>>>>>>
>>>>>>>> There is no commit protocol. Writes may be persisted on some
>>>>>>>> replicas even
>>>>>>>> though the quorum fails. Here's a sequence of events that shows the
>>>>>>>> "problem:"
>>>>>>>>
>>>>>>>> 1. Some replica R fails, but recently, so its failure has not yet
>>>>>>>> been
>>>>>>>> detected
>>>>>>>> 2. A client writes with consistency > 1
>>>>>>>> 3. The write goes to all replicas, all replicas except R persist the
>>>>>>>> write
>>>>>>>> to disk
>>>>>>>> 4. Replica R never responds
>>>>>>>> 5. Failure is returned to the client, but the new value is still in
>>>>>>>> the
>>>>>>>> cluster, on all replicas except R.
>>>>>>>>
>>>>>>>> Something very similar could happen for CL QUORUM.
>>>>>>>>
>>>>>>>> This is a conscious design decision because a commit protocol would
>>>>>>>> constitute tight coupling between nodes, which goes against the
>>>>>>>> Cassandra
>>>>>>>> philosophy. But unfortunately you do have to write your app with
>>>>>>>> this case
>>>>>>>> in mind.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Dave
>>>>>>>>
>>>>>>>> On Tue, Feb 22, 2011 at 8:22 PM, tijoriwala.ritesh <
>>>>>>>> tijoriwala.rit...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> >
>>>>>>>> > Hi,
>>>>>>>> > I wanted to get details on how does cassandra do synchronous
>>>>>>>> writes to W
>>>>>>>> > replicas (out of N)? Does it do a 2PC? If not, how does it deal
>>>>>>>> with
>>>>>>>> > failures of of nodes before it gets to write to W replicas? If the
>>>>>>>> > orchestrating node cannot write to W nodes successfully, I guess
>>>>>>>> it will
>>>>>>>> > fail the write operation but what happens to the completed writes
>>>>>>>> on X (W
>>>>>>>> > >
>>>>>>>> > X) nodes?
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> > Ritesh
>>>>>>>> > --
>>>>>>>> > View this message in context:
>>>>>>>> >
>>>>>>>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html
>>>>>>>> > Sent from the cassandra-u...@incubator.apache.org mailing list
>>>>>>>> archive at
>>>>>>>> > Nabble.com.
>>>>>>>> >
>>>>>>>>
>>>>>>>> </quote>
>>>>>>>> Quoted from:
>>>>>>>>
>>>>>>>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055408.html
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How does Cassandra handle failure during synchronous writes

Reply via email to