Re: vnodes: high availability

Jon Haddad Tue, 16 Jan 2018 10:21:54 -0800

We’ve used 32 tokens pre 3.0.  It’s been a mixed result due to the randomness.  
There’s going to be some imbalance, the amount of imbalance depends on luck, 
unfortunately.


I’m interested to hear your results using 4 tokens, would you mind letting the 
ML know your experience when you’ve done it?

Jon

> On Jan 16, 2018, at 9:40 AM, Kyrylo Lebediev <kyrylo_lebed...@epam.com> wrote:
> 
> Agree with you, Jon.
> Actually, this cluster was configured by my 'predecessor' and [fortunately 
> for him] we've never met :)
> We're using version 2.1.15 and can't upgrade because of legacy Netflix 
> Astyanax client used.
> 
> Below in the thread Alex mentioned that it's recommended to set vnodes to a 
> value lower than 256 only for C* version > 3.0 (token allocation algorithm 
> was improved since C* 3.0) .
> 
> Jon,  
> Do you have positive experience setting up  cluster with vnodes < 256 for  C* 
> 2.1? 
> 
> vnodes=32 also too high, as for me (we need to have much more than 32 servers 
> per AZ in order to to get 'reliable' cluster)
> vnodes=4 seems to be better from HA + balancing trade-off
> 
> Thanks, 
> Kyrill
> From: Jon Haddad <jonathan.had...@gmail.com> on behalf of Jon Haddad 
> <j...@jonhaddad.com>
> Sent: Tuesday, January 16, 2018 6:44:53 PM
> To: user
> Subject: Re: vnodes: high availability
>  
> While all the token math is helpful, I have to also call out the elephant in 
> the room:
> 
> You have not correctly configured Cassandra for production.
> 
> If you had used the correct endpoint snitch & network topology strategy, you 
> would be able to withstand the complete failure of an entire availability 
> zone at QUORUM, or two if you queried at CL=ONE. 
> 
> You are correct about 256 tokens causing issues, it’s one of the reasons why 
> we recommend 32.  I’m curious how things behave going as low as 4, 
> personally, but I haven’t done the math / tested it yet.
> 
> 
> 
>> On Jan 16, 2018, at 2:02 AM, Kyrylo Lebediev <kyrylo_lebed...@epam.com 
>> <mailto:kyrylo_lebed...@epam.com>> wrote:
>> 
>> ...to me it sounds like 'C* isn't that highly-available by design as it's 
>> declared'.
>> More nodes in a cluster means higher probability of simultaneous node 
>> failures.
>> And from high-availability standpoint, looks like situation is made even 
>> worse by recommended setting vnodes=256.
>> 
>> Need to do some math to get numbers/formulas, but now situation doesn't seem 
>> to be promising.
>> In case smb from C* developers/architects is reading this message, I'd be 
>> grateful to get some links to calculations of C* reliability based on which 
>> decisions were made.  
>> 
>> Regards, 
>> Kyrill
>> From: kurt greaves <k...@instaclustr.com <mailto:k...@instaclustr.com>>
>> Sent: Tuesday, January 16, 2018 2:16:34 AM
>> To: User
>> Subject: Re: vnodes: high availability
>>  
>> Yeah it's very unlikely that you will have 2 nodes in the cluster with NO 
>> intersecting token ranges (vnodes) for an RF of 3 (probably even 2).
>> 
>> If node A goes down all 256 ranges will go down, and considering there are 
>> only 49 other nodes all with 256 vnodes each, it's very likely that every 
>> node will be responsible for some range A was also responsible for. I'm not 
>> sure what the exact math is, but think of it this way: If on each node, any 
>> of its 256 token ranges overlap (it's within the next RF-1 or previous RF-1 
>> token ranges) on the ring with a token range on node A those token ranges 
>> will be down at QUORUM. 
>> 
>> Because token range assignment just uses rand() under the hood, I'm sure you 
>> could prove that it's always going to be the case that any 2 nodes going 
>> down result in a loss of QUORUM for some token range.
>> 
>> On 15 January 2018 at 19:59, Kyrylo Lebediev <kyrylo_lebed...@epam.com 
>> <mailto:kyrylo_lebed...@epam.com>> wrote:
>> Thanks Alexander!
>> 
>> I'm not a MS in math too) Unfortunately.
>> 
>> Not sure, but it seems to me that probability of 2/49 in your explanation 
>> doesn't take into account that vnodes endpoints are almost evenly 
>> distributed across all nodes (al least it's what I can see from "nodetool 
>> ring" output).
>> 
>> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
>>  
>> <http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html>
>> of course this vnodes illustration is a theoretical one, but there no 2 
>> nodes on that diagram that can be switched off without losing a key range 
>> (at CL=QUORUM). 
>> 
>> That's because vnodes_per_node=8 > Nnodes=6.
>> As far as I understand, situation is getting worse with increase of 
>> vnodes_per_node/Nnode ratio.
>> Please, correct me if I'm wrong.
>> 
>> How would the situation differ from this example by DataStax, if we had a 
>> real-life 6-nodes cluster with 8 vnodes on each node? 
>> 
>> Regards, 
>> Kyrill
>> 
>> From: Alexander Dejanovski <a...@thelastpickle.com 
>> <mailto:a...@thelastpickle.com>>
>> Sent: Monday, January 15, 2018 8:14:21 PM
>> 
>> To: user@cassandra.apache.org <mailto:user@cassandra.apache.org>
>> Subject: Re: vnodes: high availability
>>  
>> I was corrected off list that the odds of losing data when 2 nodes are down 
>> isn't dependent on the number of vnodes, but only on the number of nodes.
>> The more vnodes, the smaller the chunks of data you may lose, and vice versa.
>> I officially suck at statistics, as expected :)
>> 
>> Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski <a...@thelastpickle.com 
>> <mailto:a...@thelastpickle.com>> a écrit :
>> Hi Kyrylo,
>> 
>> the situation is a bit more nuanced than shown by the Datastax diagram, 
>> which is fairly theoretical.
>> If you're using SimpleStrategy, there is no rack awareness. Since vnode 
>> distribution is purely random, and the replica for a vnode will be placed on 
>> the node that owns the next vnode in token order (yeah, that's not easy to 
>> formulate), you end up with statistics only.
>> 
>> I kinda suck at maths but I'm going to risk making a fool of myself :)
>> 
>> The odds for one vnode to be replicated on another node are, in your case, 
>> 2/49 (out of 49 remaining nodes, 2 replicas need to be placed).
>> Given you have 256 vnodes, the odds for at least one vnode of a single node 
>> to exist on another one is 256*(2/49) = 10.4%
>> Since the relationship is bi-directional (there are the same odds for node B 
>> to have a vnode replicated on node A than the opposite), that doubles the 
>> odds of 2 nodes being both replica for at least one vnode : 20.8%.
>> 
>> Having a smaller number of vnodes will decrease the odds, just as having 
>> more nodes in the cluster.
>> (now once again, I hope my maths aren't fully wrong, I'm pretty rusty in 
>> that area...)
>> 
>> How many queries that will affect is a different question as it depends on 
>> which partition currently exist and are queried in the unavailable token 
>> ranges.
>> 
>> Then you have rack awareness that comes with NetworkTopologyStrategy : 
>> If the number of replicas (3 in your case) is proportional to the number of 
>> racks, Cassandra will spread replicas in different ones.
>> In that situation, you can theoretically lose as many nodes as you want in a 
>> single rack, you will still have two other replicas available to satisfy 
>> quorum in the remaining racks.
>> If you start losing nodes in different racks, we're back to doing maths (but 
>> the odds will get slightly different).
>> 
>> That makes maintenance predictable because you can shut down as many nodes 
>> as you want in a single rack without losing QUORUM.
>> 
>> Feel free to correct my numbers if I'm wrong.
>> 
>> Cheers,
>> 
>> 
>> 
>> 
>> 
>> On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev <kyrylo_lebed...@epam.com 
>> <mailto:kyrylo_lebed...@epam.com>> wrote:
>> Thanks, Rahul.
>> But in your example, at the same time loss of Node3 and Node6 leads to loss 
>> of ranges N, C, J at consistency level QUORUM.
>> 
>> As far as I understand in case vnodes > N_nodes_in_cluster and 
>> endpoint_snitch=SimpleSnitch, since:
>> 
>> 1) "secondary" replicas are placed on two nodes 'next' to the node 
>> responsible for a range (in case of RF=3)
>> 2) there are a lot of vnodes on each node
>> 3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,
>> 
>> we get all physical nodes (servers) having mutually adjacent  token rages.
>> Is it correct?
>> 
>> At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3 
>> for this command:
>> 
>> nodetool ring | grep '^<ip-prefix>' | awk '{print $1}' | uniq | grep -B2 -A2 
>> '<ip_of_a_node>' | grep -v '<ip_of_a_node>' | grep -v '^--' | sort | uniq | 
>> wc -l
>> 
>> returned number which equals to Nnodes -1, what means that I can't switch 
>> off 2 nodes at the same time w/o losing of some keyrange for CL=QUORUM.
>> 
>> Thanks,
>> Kyrill
>> From: Rahul Neelakantan <ra...@rahul.be <mailto:ra...@rahul.be>>
>> Sent: Monday, January 15, 2018 5:20:20 PM
>> To: user@cassandra.apache.org <mailto:user@cassandra.apache.org>
>> Subject: Re: vnodes: high availability
>>  
>> Not necessarily. It depends on how the token ranges for the vNodes are 
>> assigned to them. For example take a look at this diagram 
>> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
>>  
>> <http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html>
>> 
>> In the vNode part of the diagram, you will see that Loss of Node 3 and Node 
>> 6, will still not have any effect on Token Range A. But yes if you lose two 
>> nodes that both have Token Range A assigned to them (Say Node 1 and Node 2), 
>> you will have unavailability with your specified configuration.
>> 
>> You can sort of circumvent this by using the DataStax Java Driver and having 
>> the client recognize a degraded cluster and operate temporarily in 
>> downgraded consistency mode
>> 
>> http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html
>>  
>> <http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html>
>> 
>> - Rahul
>> 
>> On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev <kyrylo_lebed...@epam.com 
>> <mailto:kyrylo_lebed...@epam.com>> wrote:
>> Hi, 
>> 
>> Let's say we have a C* cluster with following parameters:
>>  - 50 nodes in the cluster
>>  - RF=3 
>>  - vnodes=256 per node
>>  - CL for some queries = QUORUM
>>  - endpoint_snitch = SimpleSnitch
>> 
>> Is it correct that 2 any nodes down will cause unavailability of a keyrange 
>> at CL=QUORUM?
>> 
>> Regards, 
>> Kyrill
>> 
>> 
>> 
>> -- 
>> -----------------
>> Alexander Dejanovski
>> France
>> @alexanderdeja
>> 
>> Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com <http://www.thelastpickle.com/>
>> -- 
>> -----------------
>> Alexander Dejanovski
>> France
>> @alexanderdeja
>> 
>> Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com <http://www.thelastpickle.com/>

Re: vnodes: high availability

Reply via email to