>> this actually what is happening, how is it possible to ever have a
>> node-failure resiliant cassandra cluster? 
Background http://thelastpickle.com/2011/06/13/Down-For-Me/

> I would suggest double-checking your test setup; also, make sure you
> use the same row keys every time (if this is not already the case) so
> that you have repeatable results.
Take a look at the nodetool getendpoints command. It will tell you which nodes 
a key is stored on. Though for RF 3 and N3 it's all of them :)

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/12/2012, at 12:54 AM, Tristan Seligmann <mithra...@mithrandi.net> wrote:

> On Thu, Dec 20, 2012 at 11:26 AM, Vasileios Vlachos
> <vasileiosvlac...@gmail.com> wrote:
>> Initially we were thinking the same thing, that an explanation would
>> be that the "wrong" node could be down, but then isn't this something
>> that hinted handoff sorts out?
> 
> If a node is partitioned from the rest of the cluster (ie. the node
> goes down, but later comes back with the same data it had), it will
> obviously be out of data with regard to any writes that happened while
> it was down. Anti-entropy (nodetool repair) and read repair will
> repair this inconsistency over time, but not right away; hinted
> handoff is an optimization that will allow the node to become mostly
> consistent right away on rejoining the cluster, as the nodes will have
> stored hints for it while it was down, and will send it them once the
> node is back up.
> 
> However, the important thing to note is that this is an
> /optimization/. If a replica is down, then it will not be able to
> satisfy any consistency level requirements, except for the special
> case of CL=ANY. If you use another CL like TWO, then two actual
> replica nodes must be up for the ranges you are writing to, a node
> that is not a replica but will write a hint does not count.
> 
>> Test 2 (2/3 Nodes UP):
>> CL  :    ANY    ONE    TWO    THREE    QUORUM    ALL
>> RF 2:    OK     OK     x      x        OK        x
> 
> For this test, QUORUM = RF/2+1 = 2/2+1 = 2. A write at QUORUM should
> have succeded if both of the replicas for the range were up, but if
> one of the replicas for the range was the downed node, then it would
> have failed. I think you can use the 'nodetool getendpoints' command
> to list the nodes that are replicas for the given row key.
> 
> I am unable to explain how a write at QUORUM could succeed if a write
> at TWO for the same key failed.
> 
>> Test 3 (2/3 Nodes UP):
>> CL  :    ANY    ONE    TWO    THREE    QUORUM    ALL
>> RF 3:    OK     OK     x      x        OK        OK
> 
> For this test, QUORUM = RF/2+1 = 3/2+1 = 2. Again, I am unable to
> explain why a write at QUORUM would succeed if a write at TWO failed,
> and I am also unable to explain how a write at ALL could succeed, for
> any key, if one of the nodes is down.
> 
> I would suggest double-checking your test setup; also, make sure you
> use the same row keys every time (if this is not already the case) so
> that you have repeatable results.
> 
>> Furthermore, with regards to being "unlucky" with the "wrong node" if
>> this actually what is happening, how is it possible to ever have a
>> node-failure resiliant cassandra cluster? My understanding of this
>> implies that even with 100 nodes, every 1/100 writes would fail until
>> the node is replaced/repaired.
> 
> RF is the important number when considering fault-tolerance in your
> cluster, not the number of nodes. If RF=3, and you read and write at
> quorum, then you can tolerate one node being down in the range you are
> operating on. If you need to be able to tolerate two nodes being down,
> RF=5 and QUORUM would work. In other words, if you need better fault
> tolerance, RF is what you need to increase; if you need better
> performance, or you need to store more data, then N (number of nodes
> in cluster) is what you need to increase. Of course, N must be at
> least as big as RF...
> --
> mithrandi, i Ainil en-Balandor, a faer Ambar

Reply via email to