>> this actually what is happening, how is it possible to ever have a >> node-failure resiliant cassandra cluster? Background http://thelastpickle.com/2011/06/13/Down-For-Me/
> I would suggest double-checking your test setup; also, make sure you > use the same row keys every time (if this is not already the case) so > that you have repeatable results. Take a look at the nodetool getendpoints command. It will tell you which nodes a key is stored on. Though for RF 3 and N3 it's all of them :) Cheers ----------------- Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 21/12/2012, at 12:54 AM, Tristan Seligmann <mithra...@mithrandi.net> wrote: > On Thu, Dec 20, 2012 at 11:26 AM, Vasileios Vlachos > <vasileiosvlac...@gmail.com> wrote: >> Initially we were thinking the same thing, that an explanation would >> be that the "wrong" node could be down, but then isn't this something >> that hinted handoff sorts out? > > If a node is partitioned from the rest of the cluster (ie. the node > goes down, but later comes back with the same data it had), it will > obviously be out of data with regard to any writes that happened while > it was down. Anti-entropy (nodetool repair) and read repair will > repair this inconsistency over time, but not right away; hinted > handoff is an optimization that will allow the node to become mostly > consistent right away on rejoining the cluster, as the nodes will have > stored hints for it while it was down, and will send it them once the > node is back up. > > However, the important thing to note is that this is an > /optimization/. If a replica is down, then it will not be able to > satisfy any consistency level requirements, except for the special > case of CL=ANY. If you use another CL like TWO, then two actual > replica nodes must be up for the ranges you are writing to, a node > that is not a replica but will write a hint does not count. > >> Test 2 (2/3 Nodes UP): >> CL : ANY ONE TWO THREE QUORUM ALL >> RF 2: OK OK x x OK x > > For this test, QUORUM = RF/2+1 = 2/2+1 = 2. A write at QUORUM should > have succeded if both of the replicas for the range were up, but if > one of the replicas for the range was the downed node, then it would > have failed. I think you can use the 'nodetool getendpoints' command > to list the nodes that are replicas for the given row key. > > I am unable to explain how a write at QUORUM could succeed if a write > at TWO for the same key failed. > >> Test 3 (2/3 Nodes UP): >> CL : ANY ONE TWO THREE QUORUM ALL >> RF 3: OK OK x x OK OK > > For this test, QUORUM = RF/2+1 = 3/2+1 = 2. Again, I am unable to > explain why a write at QUORUM would succeed if a write at TWO failed, > and I am also unable to explain how a write at ALL could succeed, for > any key, if one of the nodes is down. > > I would suggest double-checking your test setup; also, make sure you > use the same row keys every time (if this is not already the case) so > that you have repeatable results. > >> Furthermore, with regards to being "unlucky" with the "wrong node" if >> this actually what is happening, how is it possible to ever have a >> node-failure resiliant cassandra cluster? My understanding of this >> implies that even with 100 nodes, every 1/100 writes would fail until >> the node is replaced/repaired. > > RF is the important number when considering fault-tolerance in your > cluster, not the number of nodes. If RF=3, and you read and write at > quorum, then you can tolerate one node being down in the range you are > operating on. If you need to be able to tolerate two nodes being down, > RF=5 and QUORUM would work. In other words, if you need better fault > tolerance, RF is what you need to increase; if you need better > performance, or you need to store more data, then N (number of nodes > in cluster) is what you need to increase. Of course, N must be at > least as big as RF... > -- > mithrandi, i Ainil en-Balandor, a faer Ambar