>> replicas but to ensure we read at least one newest value as long as write
>> quorum succeeded beforehand and W+R > N.
> 
This is correct.
It's not that a quorum of nodes agree it's that a quorum of nodes participate. 
If a quorum participate in both the write and read you are guaranteed that one 
node was involved in both. The wikipedia definition helps here "A quorum is the 
minimum number of members of a deliberative assembly necessary to conduct the 
business of that group" http://en.wikipedia.org/wiki/Quorum  

It's a two step process: First do we have enough people to make a decision? 
Second following the rules what was the decision?

In C* the rule is to use the value with the highest time stamp. Not the value 
with the highest number of  "votes". The red boxes on this slide are the 
winning values 
http://www.slideshare.net/aaronmorton/cassandra-does-what-code-mania-2012/67  
(thinking one of my slides in that deck may have been misleading in the past). 
In Riak the rule is to use Vector Clocks. 

So 
> I agree that returning val4 is the right thing to do if quorum (two) nodes
> among (node1,node2,node3) have the val4
Is incorrect.
We return the value with the highest time stamp returned from the nodes 
involved in the read. Only one needs to have val4. 

> The heart of the problem
> here is that the coordinator responds to a client request "assuming" that
> the consistency has been achieved the moment is issues a row repair with the
> super-set of the resolved value; without receiving acknowledgement on the
> success of a repair from the replicas for a given consistency constraint. 
and
> My intuition behind saying this is because we
> would respond to the client without the replicas having confirmed their
> meeting the consistency requirement.

It is not necessary for the coordinator to wait. 

Consider an example: The app has stopped writing to the cluster, for a certain 
column nodes 1,2 and 3 have value:timestamp bar:2, bar:2 and foo:1 
respectively. The last write was a successful CL QUORUM write of bar with 
timestamp 2. However node 3 did acknowledge this write for some reason. 

To make it interesting the commit log volume on node 3 is full. Mutations are 
blocking in the commit log queue so any write on node 3 will timeout and fail, 
but reads are still working. We could imagine this is why node 3 did not commit 
bar:2 

Some read examples, RR is not active:

1) Client reads from node 4 (a non replica) with CL QUOURM, request goes to 
nodes 1 and 2. Both agree on bar as value. 
2) Client reads from node 3 with CL QUORUM, request is processed locally and on 
node 2.
        * There is a digest mismatch
        * Row Repair read runs to read from for nodes 2 and 3.
        * The super set resolves to bar:2
        * Node 3 (the coordinator) queues a delta write locally to write bar:2. 
No other delta writes are sent.
        * Node 3 returns bar:2 to the client
3) Client reads from node 3 at CL QUOURM. The same thing as (2) happens and 
bar:2 is returned. 
4) Client reads from node 2 at CL QUOURM, read goes to 2 and 3. Roughly the 
same thing as (2) happens and bar:2 is returned. 
5) Client reads from node 1 as CL ONE. Read happens locally only and returns 
bar:2
6) Client reads from node 3 as CL ONE. Read happens locally only and returns 
foo:1

So:
* A read CL QUOURM will always return bar:2 even if node 3 only has foo:1 on 
disk. 
* A read at CL ONE will return no value or any previous write.

The delta write from the Row Repair goes to a single node so R + W > N cannot 
be applied. It can almost be thought of as  internal implementation. The delta 
write from a Digest Mismatch, HH writes, full RR writes and nodetool repair are 
used to:

* Reduce the chance of a Digest Mismatch when CL > ONE
* Eventually reach a state where reads at any CL return the last write. 

They are not used to ensure strong consistency when R + W > N. You could turn 
those things off and R + W > N would still work. 
 
Hope that helps. 


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 26/10/2012, at 7:15 AM, shankarpnsn <shankarp...@gmail.com> wrote:

> manuzhang wrote
>> read quorum doesn't mean we read newest values from a quorum number of
>> replicas but to ensure we read at least one newest value as long as write
>> quorum succeeded beforehand and W+R > N.
> 
> I beg to differ here. Any read/write, by definition of quorum, should have
> at least n/2 + 1 replicas that agree on that read/write value. Responding to
> the user with a newer value, even if the write creating the new value hasn't
> completed cannot guarantee any read consistency > 1. 
> 
> 
> Hiller, Dean wrote
>>> Kind of an interesting question
>>> 
>>> I think you are saying if a client read resolved only the two nodes as
>>> said in Aaron's email back to the client and read -repair was kicked off
>>> because of the inconsistent values and the write did not complete yet and
>>> I guess you would have two nodes go down to lose the value right after
>>> the
>>> read, and before write was finished such that the client read a value
>>> that
>>> was never stored in the database.  The odds of two nodes going out are
>>> pretty slim though.
>>> Thanks,
>>> Dean
> 
> Bingo! I do understand that the odds of a quorum nodes going down are low
> and that any subsequent read would achieve a quorum. However, I'm wondering
> what would be the right thing to do here, given that the client has
> particularly asked for a certain consistency on the read and cassandra
> returns a value that doesn't have the consistency. The heart of the problem
> here is that the coordinator responds to a client request "assuming" that
> the consistency has been achieved the moment is issues a row repair with the
> super-set of the resolved value; without receiving acknowledgement on the
> success of a repair from the replicas for a given consistency constraint. 
> 
> In order to adhere to the given consistency specification, the row repair
> (due to consistent reads) should repeat the read after issuing a
> "consistency repair" to ensure if the consistency is met. Like Manu
> mentioned, this could of course lead to a number of repeat reads if the
> writes arrive quickly - until the read gets timed out. However, note that we
> would still be honoring the consistency constraint for that read. 
> 
> 
> 
> --
> View this message in context: 
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583400.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
> Nabble.com.

Reply via email to