>Remember the simple rule. Column with highest timestamp is the one that will be considered correct EVENTUALLY. So consider following case:
I am sorry, that will return inconsistent results even a Q. Time stamp have nothing to do with this. It is just an application provided artifact and could be anything. >c. Read with CL = QUORUM. If read hits node1 and node2/node3, new data that was written to node1 will be returned. In this case - N1 will be identified as a discrepancy and the change will be discarded via read repair On Wed, Feb 23, 2011 at 6:47 PM, Narendra Sharma <narendra.sha...@gmail.com>wrote: > Remember the simple rule. Column with highest timestamp is the one that > will be considered correct EVENTUALLY. So consider following case: > > Cluster size = 3 (say node1, node2 and node3), RF = 3, Read/Write CL = > QUORUM > a. QUORUM in this case requires 2 nodes. Write failed with successful write > to only 1 node say node1. > b. Read with CL = QUORUM. If read hits node2 and node3, old data will be > returned with read repair triggered in background. On next read you will get > the data that was written to node1. > c. Read with CL = QUORUM. If read hits node1 and node2/node3, new data that > was written to node1 will be returned. > > HTH! > > Thanks, > Naren > > > > On Wed, Feb 23, 2011 at 3:36 PM, Ritesh Tijoriwala < > tijoriwala.rit...@gmail.com> wrote: > >> Hi Anthony, >> I am not talking about the case of CL ANY. I am talking about the case >> where your consistency level is R + W > N and you want to write to W nodes >> but only succeed in writing to X ( where X < W) nodes and hence fail the >> write to the client. >> >> thanks, >> Ritesh >> >> On Wed, Feb 23, 2011 at 2:48 PM, Anthony John <chirayit...@gmail.com>wrote: >> >>> Ritesh, >>> >>> At CL ANY - if all endpoints are down - a HH is written. And it is a >>> successful write - not a failed write. >>> >>> Now that does not guarantee a READ of the value just written - but that >>> is a risk that you take when you use the ANY CL! >>> >>> HTH, >>> >>> -JA >>> >>> >>> On Wed, Feb 23, 2011 at 4:40 PM, Ritesh Tijoriwala < >>> tijoriwala.rit...@gmail.com> wrote: >>> >>>> hi Anthony, >>>> While you stated the facts right, I don't see how it relates to the >>>> question I ask. Can you elaborate specifically what happens in the case I >>>> mentioned above to Dave? >>>> >>>> thanks, >>>> Ritesh >>>> >>>> >>>> On Wed, Feb 23, 2011 at 1:57 PM, Anthony John <chirayit...@gmail.com>wrote: >>>> >>>>> Seems to me that the explanations are getting incredibly complicated - >>>>> while I submit the real issue is not! >>>>> >>>>> Salient points here:- >>>>> 1. To be guaranteed data consistency - the writes and reads have to be >>>>> at Quorum CL or more >>>>> 2. Any W/R at lesser CL means that the application has to handle the >>>>> inconsistency, or has to be tolerant of it >>>>> 3. Writing at "ANY" CL - a special case - means that writes will always >>>>> go through (as long as any node is up), even if the destination nodes are >>>>> not up. This is done via hinted handoff. But this can result in >>>>> inconsistent >>>>> reads, and yes that is a problem but refer to pt-2 above >>>>> 4. At QUORUM CL R/W - after Quorum is met, hinted handoffs are used to >>>>> handle that case where a particular node is down and the write needs to be >>>>> replicated to it. But this will not cause inconsistent R as the hinted >>>>> handoff (in this case) only applies after Quorum is met - so a Quorum R is >>>>> not dependent on the down node being up, and having got the hint. >>>>> >>>>> Hope I state this appropriately! >>>>> >>>>> HTH, >>>>> >>>>> -JA >>>>> >>>>> >>>>> On Wed, Feb 23, 2011 at 3:39 PM, Ritesh Tijoriwala < >>>>> tijoriwala.rit...@gmail.com> wrote: >>>>> >>>>>> > Read repair will probably occur at that point (depending on your >>>>>> config), which would cause the newest value to propagate to more >>>>>> replicas. >>>>>> >>>>>> Is the newest value the "quorum" value which means it is the old value >>>>>> that will be written back to the nodes having "newer non-quorum" value or >>>>>> the newest value is the real new value? :) If later, than this seems >>>>>> kind of >>>>>> odd to me and how it will be useful to any application. A bug? >>>>>> >>>>>> Thanks, >>>>>> Ritesh >>>>>> >>>>>> >>>>>> On Wed, Feb 23, 2011 at 12:43 PM, Dave Revell <d...@meebo-inc.com>wrote: >>>>>> >>>>>>> Ritesh, >>>>>>> >>>>>>> You have seen the problem. Clients may read the newly written value >>>>>>> even though the client performing the write saw it as a failure. When >>>>>>> the >>>>>>> client reads, it will use the correct number of replicas for the chosen >>>>>>> CL, >>>>>>> then return the newest value seen at any replica. This "newest value" >>>>>>> could >>>>>>> be the result of a failed write. >>>>>>> >>>>>>> Read repair will probably occur at that point (depending on your >>>>>>> config), which would cause the newest value to propagate to more >>>>>>> replicas. >>>>>>> >>>>>>> R+W>N guarantees serial order of operations: any read at CL=R that >>>>>>> occurs after a write at CL=W will observe the write. I don't think this >>>>>>> property is relevant to your current question, though. >>>>>>> >>>>>>> Cassandra has no mechanism to "roll back" the partial write, other >>>>>>> than to simply write again. This may also fail. >>>>>>> >>>>>>> Best, >>>>>>> Dave >>>>>>> >>>>>>> >>>>>>> On Wed, Feb 23, 2011 at 10:12 AM, <tijoriwala.rit...@gmail.com>wrote: >>>>>>> >>>>>>>> Hi Dave, >>>>>>>> Thanks for your input. In the steps you mention, what happens when >>>>>>>> client tries to read the value at step 6? Is it possible that the >>>>>>>> client may >>>>>>>> see the new value? My understanding was if R + W > N, then client will >>>>>>>> not >>>>>>>> see the new value as Quorum nodes will not agree on the new value. If >>>>>>>> that >>>>>>>> is the case, then its alright to return failure to the client. >>>>>>>> However, if >>>>>>>> not, then it is difficult to program as after every failure, you as an >>>>>>>> client are not sure if failure is a pseudo failure with some side >>>>>>>> effects or >>>>>>>> real failure. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Ritesh >>>>>>>> >>>>>>>> <quote author='Dave Revell'> >>>>>>>> >>>>>>>> Ritesh, >>>>>>>> >>>>>>>> There is no commit protocol. Writes may be persisted on some >>>>>>>> replicas even >>>>>>>> though the quorum fails. Here's a sequence of events that shows the >>>>>>>> "problem:" >>>>>>>> >>>>>>>> 1. Some replica R fails, but recently, so its failure has not yet >>>>>>>> been >>>>>>>> detected >>>>>>>> 2. A client writes with consistency > 1 >>>>>>>> 3. The write goes to all replicas, all replicas except R persist the >>>>>>>> write >>>>>>>> to disk >>>>>>>> 4. Replica R never responds >>>>>>>> 5. Failure is returned to the client, but the new value is still in >>>>>>>> the >>>>>>>> cluster, on all replicas except R. >>>>>>>> >>>>>>>> Something very similar could happen for CL QUORUM. >>>>>>>> >>>>>>>> This is a conscious design decision because a commit protocol would >>>>>>>> constitute tight coupling between nodes, which goes against the >>>>>>>> Cassandra >>>>>>>> philosophy. But unfortunately you do have to write your app with >>>>>>>> this case >>>>>>>> in mind. >>>>>>>> >>>>>>>> Best, >>>>>>>> Dave >>>>>>>> >>>>>>>> On Tue, Feb 22, 2011 at 8:22 PM, tijoriwala.ritesh < >>>>>>>> tijoriwala.rit...@gmail.com> wrote: >>>>>>>> >>>>>>>> > >>>>>>>> > Hi, >>>>>>>> > I wanted to get details on how does cassandra do synchronous >>>>>>>> writes to W >>>>>>>> > replicas (out of N)? Does it do a 2PC? If not, how does it deal >>>>>>>> with >>>>>>>> > failures of of nodes before it gets to write to W replicas? If the >>>>>>>> > orchestrating node cannot write to W nodes successfully, I guess >>>>>>>> it will >>>>>>>> > fail the write operation but what happens to the completed writes >>>>>>>> on X (W >>>>>>>> > > >>>>>>>> > X) nodes? >>>>>>>> > >>>>>>>> > Thanks, >>>>>>>> > Ritesh >>>>>>>> > -- >>>>>>>> > View this message in context: >>>>>>>> > >>>>>>>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html >>>>>>>> > Sent from the cassandra-u...@incubator.apache.org mailing list >>>>>>>> archive at >>>>>>>> > Nabble.com. >>>>>>>> > >>>>>>>> >>>>>>>> </quote> >>>>>>>> Quoted from: >>>>>>>> >>>>>>>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055408.html >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >