On Thu, Dec 20, 2012 at 11:26 AM, Vasileios Vlachos <vasileiosvlac...@gmail.com> wrote: > Initially we were thinking the same thing, that an explanation would > be that the "wrong" node could be down, but then isn't this something > that hinted handoff sorts out?
If a node is partitioned from the rest of the cluster (ie. the node goes down, but later comes back with the same data it had), it will obviously be out of data with regard to any writes that happened while it was down. Anti-entropy (nodetool repair) and read repair will repair this inconsistency over time, but not right away; hinted handoff is an optimization that will allow the node to become mostly consistent right away on rejoining the cluster, as the nodes will have stored hints for it while it was down, and will send it them once the node is back up. However, the important thing to note is that this is an /optimization/. If a replica is down, then it will not be able to satisfy any consistency level requirements, except for the special case of CL=ANY. If you use another CL like TWO, then two actual replica nodes must be up for the ranges you are writing to, a node that is not a replica but will write a hint does not count. > Test 2 (2/3 Nodes UP): > CL : ANY ONE TWO THREE QUORUM ALL > RF 2: OK OK x x OK x For this test, QUORUM = RF/2+1 = 2/2+1 = 2. A write at QUORUM should have succeded if both of the replicas for the range were up, but if one of the replicas for the range was the downed node, then it would have failed. I think you can use the 'nodetool getendpoints' command to list the nodes that are replicas for the given row key. I am unable to explain how a write at QUORUM could succeed if a write at TWO for the same key failed. > Test 3 (2/3 Nodes UP): > CL : ANY ONE TWO THREE QUORUM ALL > RF 3: OK OK x x OK OK For this test, QUORUM = RF/2+1 = 3/2+1 = 2. Again, I am unable to explain why a write at QUORUM would succeed if a write at TWO failed, and I am also unable to explain how a write at ALL could succeed, for any key, if one of the nodes is down. I would suggest double-checking your test setup; also, make sure you use the same row keys every time (if this is not already the case) so that you have repeatable results. > Furthermore, with regards to being "unlucky" with the "wrong node" if > this actually what is happening, how is it possible to ever have a > node-failure resiliant cassandra cluster? My understanding of this > implies that even with 100 nodes, every 1/100 writes would fail until > the node is replaced/repaired. RF is the important number when considering fault-tolerance in your cluster, not the number of nodes. If RF=3, and you read and write at quorum, then you can tolerate one node being down in the range you are operating on. If you need to be able to tolerate two nodes being down, RF=5 and QUORUM would work. In other words, if you need better fault tolerance, RF is what you need to increase; if you need better performance, or you need to store more data, then N (number of nodes in cluster) is what you need to increase. Of course, N must be at least as big as RF... -- mithrandi, i Ainil en-Balandor, a faer Ambar