>> replicas but to ensure we read at least one newest value as long as write >> quorum succeeded beforehand and W+R > N. > This is correct. It's not that a quorum of nodes agree it's that a quorum of nodes participate. If a quorum participate in both the write and read you are guaranteed that one node was involved in both. The wikipedia definition helps here "A quorum is the minimum number of members of a deliberative assembly necessary to conduct the business of that group" http://en.wikipedia.org/wiki/Quorum
It's a two step process: First do we have enough people to make a decision? Second following the rules what was the decision? In C* the rule is to use the value with the highest time stamp. Not the value with the highest number of "votes". The red boxes on this slide are the winning values http://www.slideshare.net/aaronmorton/cassandra-does-what-code-mania-2012/67 (thinking one of my slides in that deck may have been misleading in the past). In Riak the rule is to use Vector Clocks. So > I agree that returning val4 is the right thing to do if quorum (two) nodes > among (node1,node2,node3) have the val4 Is incorrect. We return the value with the highest time stamp returned from the nodes involved in the read. Only one needs to have val4. > The heart of the problem > here is that the coordinator responds to a client request "assuming" that > the consistency has been achieved the moment is issues a row repair with the > super-set of the resolved value; without receiving acknowledgement on the > success of a repair from the replicas for a given consistency constraint. and > My intuition behind saying this is because we > would respond to the client without the replicas having confirmed their > meeting the consistency requirement. It is not necessary for the coordinator to wait. Consider an example: The app has stopped writing to the cluster, for a certain column nodes 1,2 and 3 have value:timestamp bar:2, bar:2 and foo:1 respectively. The last write was a successful CL QUORUM write of bar with timestamp 2. However node 3 did acknowledge this write for some reason. To make it interesting the commit log volume on node 3 is full. Mutations are blocking in the commit log queue so any write on node 3 will timeout and fail, but reads are still working. We could imagine this is why node 3 did not commit bar:2 Some read examples, RR is not active: 1) Client reads from node 4 (a non replica) with CL QUOURM, request goes to nodes 1 and 2. Both agree on bar as value. 2) Client reads from node 3 with CL QUORUM, request is processed locally and on node 2. * There is a digest mismatch * Row Repair read runs to read from for nodes 2 and 3. * The super set resolves to bar:2 * Node 3 (the coordinator) queues a delta write locally to write bar:2. No other delta writes are sent. * Node 3 returns bar:2 to the client 3) Client reads from node 3 at CL QUOURM. The same thing as (2) happens and bar:2 is returned. 4) Client reads from node 2 at CL QUOURM, read goes to 2 and 3. Roughly the same thing as (2) happens and bar:2 is returned. 5) Client reads from node 1 as CL ONE. Read happens locally only and returns bar:2 6) Client reads from node 3 as CL ONE. Read happens locally only and returns foo:1 So: * A read CL QUOURM will always return bar:2 even if node 3 only has foo:1 on disk. * A read at CL ONE will return no value or any previous write. The delta write from the Row Repair goes to a single node so R + W > N cannot be applied. It can almost be thought of as internal implementation. The delta write from a Digest Mismatch, HH writes, full RR writes and nodetool repair are used to: * Reduce the chance of a Digest Mismatch when CL > ONE * Eventually reach a state where reads at any CL return the last write. They are not used to ensure strong consistency when R + W > N. You could turn those things off and R + W > N would still work. Hope that helps. ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 26/10/2012, at 7:15 AM, shankarpnsn <shankarp...@gmail.com> wrote: > manuzhang wrote >> read quorum doesn't mean we read newest values from a quorum number of >> replicas but to ensure we read at least one newest value as long as write >> quorum succeeded beforehand and W+R > N. > > I beg to differ here. Any read/write, by definition of quorum, should have > at least n/2 + 1 replicas that agree on that read/write value. Responding to > the user with a newer value, even if the write creating the new value hasn't > completed cannot guarantee any read consistency > 1. > > > Hiller, Dean wrote >>> Kind of an interesting question >>> >>> I think you are saying if a client read resolved only the two nodes as >>> said in Aaron's email back to the client and read -repair was kicked off >>> because of the inconsistent values and the write did not complete yet and >>> I guess you would have two nodes go down to lose the value right after >>> the >>> read, and before write was finished such that the client read a value >>> that >>> was never stored in the database. The odds of two nodes going out are >>> pretty slim though. >>> Thanks, >>> Dean > > Bingo! I do understand that the odds of a quorum nodes going down are low > and that any subsequent read would achieve a quorum. However, I'm wondering > what would be the right thing to do here, given that the client has > particularly asked for a certain consistency on the read and cassandra > returns a value that doesn't have the consistency. The heart of the problem > here is that the coordinator responds to a client request "assuming" that > the consistency has been achieved the moment is issues a row repair with the > super-set of the resolved value; without receiving acknowledgement on the > success of a repair from the replicas for a given consistency constraint. > > In order to adhere to the given consistency specification, the row repair > (due to consistent reads) should repeat the read after issuing a > "consistency repair" to ensure if the consistency is met. Like Manu > mentioned, this could of course lead to a number of repeat reads if the > writes arrive quickly - until the read gets timed out. However, note that we > would still be honoring the consistency constraint for that read. > > > > -- > View this message in context: > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583400.html > Sent from the cassandra-u...@incubator.apache.org mailing list archive at > Nabble.com.