Re: vnodes: high availability

Hannu Kröger Mon, 12 Mar 2018 03:18:07 -0700

If this is a universal recommendation, then should that actually be default in 
Cassandra?


Hannu

> On 18 Jan 2018, at 00:49, Jon Haddad <j...@jonhaddad.com> wrote:
> 
> I *strongly* recommend disabling dynamic snitch.  I’ve seen it make latency 
> jump 10x.  
> 
> dynamic_snitch: false is your friend.
> 
> 
> 
>> On Jan 17, 2018, at 2:00 PM, Kyrylo Lebediev <kyrylo_lebed...@epam.com 
>> <mailto:kyrylo_lebed...@epam.com>> wrote:
>> 
>> Avi, 
>> If we prefer to have better balancing [like absence of hotspots during a 
>> node down event etc], large number of vnodes is a good solution.
>> Personally, I wouldn't prefer any balancing over overall resiliency  (and in 
>> case of non-optimal setup, larger number of nodes in a cluster decreases 
>> overall resiliency, as far as I understand.) 
>> 
>> Talking about hotspots, there is a number of features helping to mitigate 
>> the issue, for example:
>>   - dynamic snitch [if a node overloaded it won't be queried]
>>   - throttling of streaming operations
>> 
>> Thanks, 
>> Kyrill
>> 
>> From: Avi Kivity <a...@scylladb.com <mailto:a...@scylladb.com>>
>> Sent: Wednesday, January 17, 2018 2:50 PM
>> To: user@cassandra.apache.org <mailto:user@cassandra.apache.org>; kurt 
>> greaves
>> Subject: Re: vnodes: high availability
>>  
>> On the flip side, a large number of vnodes is also beneficial. For example, 
>> if you add a node to a 20-node cluster with many vnodes, each existing node 
>> will contribute 5% of the data towards the new node, and all nodes will 
>> participate in streaming (meaning the impact on any single node will be 
>> limited, and completion time will be faster).
>> 
>> With a low number of vnodes, only a few nodes participate in streaming, 
>> which means that the cluster is left unbalanced and the impact on each 
>> streaming node is greater (or that completion time is slower).
>> 
>> Similarly, with a high number of vnodes, if a node is down its work is 
>> distributed equally among all nodes. With a low number of vnodes the cluster 
>> becomes unbalanced.
>> 
>> Overall I recommend high vnode count, and to limit the impact of failures in 
>> other ways (smaller number of large nodes vs. larger number of small nodes).
>> 
>> btw, rack-aware topology improves the multi-failure problem but at the cost 
>> of causing imbalance during maintenance operations. I recommend using 
>> rack-aware topology only if you really have racks with 
>> single-points-of-failure, not for other reasons.
>> 
>> On 01/17/2018 05:43 AM, kurt greaves wrote:
>>> Even with a low amount of vnodes you're asking for a bad time. Even if you 
>>> managed to get down to 2 vnodes per node, you're still likely to include 
>>> double the amount of nodes in any streaming/repair operation which will 
>>> likely be very problematic for incremental repairs, and you still won't be 
>>> able to easily reason about which nodes are responsible for which token 
>>> ranges. It's still quite likely that a loss of 2 nodes would mean some 
>>> portion of the ring is down (at QUORUM). At the moment I'd say steer clear 
>>> of vnodes and use single tokens if you can; a lot of work still needs to be 
>>> done to ensure smooth operation of C* while using vnodes, and they are much 
>>> more difficult to reason about (which is probably the reason no one has 
>>> bothered to do the math). If you're really keen on the math your best bet 
>>> is to do it yourself, because it's not a point of interest for many C* devs 
>>> plus probably a lot of us wouldn't remember enough math to know how to 
>>> approach it.
>>> 
>>> If you want to get out of this situation you'll need to do a DC migration 
>>> to a new DC with a better configuration of snitch/replication 
>>> strategy/racks/tokens.
>>> 
>>> 
>>> On 16 January 2018 at 21:54, Kyrylo Lebediev <kyrylo_lebed...@epam.com 
>>> <mailto:kyrylo_lebed...@epam.com>> wrote:
>>> Thank you for this valuable info, Jon.
>>> I guess both you and Alex are referring to improved vnodes allocation 
>>> method  https://issues.apache.org/jira/browse/CASSANDRA-7032 
>>> <https://issues.apache.org/jira/browse/CASSANDRA-7032> which was 
>>> implemented in 3.0.
>>> Based on your info and comments in the ticket it's really a bad idea to 
>>> have small number of vnodes for the versions using old allocation method 
>>> because of hot-spots, so it's not an option for my particular case (v.2.1) 
>>> :( 
>>> 
>>> [As far as I can see from the source code this new method wasn't backported 
>>> to 2.1.]
>>> 
>>> 
>>> Regards, 
>>> Kyrill
>>> [CASSANDRA-7032] Improve vnode allocation - ASF JIRA 
>>> <https://issues.apache.org/jira/browse/CASSANDRA-7032>
>>> issues.apache.org <http://issues.apache.org/>
>>> It's been known for a little while that random vnode allocation causes 
>>> hotspots of ownership. It should be possible to improve dramatically on 
>>> this with deterministic ...
>>> 
>>> From: Jon Haddad <jonathan.had...@gmail.com 
>>> <mailto:jonathan.had...@gmail.com>> on behalf of Jon Haddad 
>>> <j...@jonhaddad.com <mailto:j...@jonhaddad.com>>
>>> Sent: Tuesday, January 16, 2018 8:21:33 PM
>>> 
>>> To: user@cassandra.apache.org <mailto:user@cassandra.apache.org>
>>> Subject: Re: vnodes: high availability
>>>  
>>> We’ve used 32 tokens pre 3.0.  It’s been a mixed result due to the 
>>> randomness.  There’s going to be some imbalance, the amount of imbalance 
>>> depends on luck, unfortunately.
>>> 
>>> I’m interested to hear your results using 4 tokens, would you mind letting 
>>> the ML know your experience when you’ve done it?
>>> 
>>> Jon
>>> 
>>>> On Jan 16, 2018, at 9:40 AM, Kyrylo Lebediev <kyrylo_lebed...@epam.com 
>>>> <mailto:kyrylo_lebed...@epam.com>> wrote:
>>>> 
>>>> Agree with you, Jon.
>>>> Actually, this cluster was configured by my 'predecessor' and [fortunately 
>>>> for him] we've never met :)
>>>> We're using version 2.1.15 and can't upgrade because of legacy Netflix 
>>>> Astyanax client used.
>>>> 
>>>> Below in the thread Alex mentioned that it's recommended to set vnodes to 
>>>> a value lower than 256 only for C* version > 3.0 (token allocation 
>>>> algorithm was improved since C* 3.0) .
>>>> 
>>>> Jon,  
>>>> Do you have positive experience setting up  cluster with vnodes < 256 for  
>>>> C* 2.1? 
>>>> 
>>>> vnodes=32 also too high, as for me (we need to have much more than 32 
>>>> servers per AZ in order to to get 'reliable' cluster)
>>>> vnodes=4 seems to be better from HA + balancing trade-off
>>>> 
>>>> Thanks, 
>>>> Kyrill
>>>> From: Jon Haddad <jonathan.had...@gmail.com 
>>>> <mailto:jonathan.had...@gmail.com>> on behalf of Jon Haddad 
>>>> <j...@jonhaddad.com <mailto:j...@jonhaddad.com>>
>>>> Sent: Tuesday, January 16, 2018 6:44:53 PM
>>>> To: user
>>>> Subject: Re: vnodes: high availability
>>>>  
>>>> While all the token math is helpful, I have to also call out the elephant 
>>>> in the room:
>>>> 
>>>> You have not correctly configured Cassandra for production.
>>>> 
>>>> If you had used the correct endpoint snitch & network topology strategy, 
>>>> you would be able to withstand the complete failure of an entire 
>>>> availability zone at QUORUM, or two if you queried at CL=ONE. 
>>>> 
>>>> You are correct about 256 tokens causing issues, it’s one of the reasons 
>>>> why we recommend 32.  I’m curious how things behave going as low as 4, 
>>>> personally, but I haven’t done the math / tested it yet.
>>>> 
>>>> 
>>>> 
>>>>> On Jan 16, 2018, at 2:02 AM, Kyrylo Lebediev <kyrylo_lebed...@epam.com 
>>>>> <mailto:kyrylo_lebed...@epam.com>> wrote:
>>>>> 
>>>>> ...to me it sounds like 'C* isn't that highly-available by design as it's 
>>>>> declared'.
>>>>> More nodes in a cluster means higher probability of simultaneous node 
>>>>> failures.
>>>>> And from high-availability standpoint, looks like situation is made even 
>>>>> worse by recommended setting vnodes=256.
>>>>> 
>>>>> Need to do some math to get numbers/formulas, but now situation doesn't 
>>>>> seem to be promising.
>>>>> In case smb from C* developers/architects is reading this message, I'd be 
>>>>> grateful to get some links to calculations of C* reliability based on 
>>>>> which decisions were made.  
>>>>> 
>>>>> Regards, 
>>>>> Kyrill
>>>>> From: kurt greaves <k...@instaclustr.com <mailto:k...@instaclustr.com>>
>>>>> Sent: Tuesday, January 16, 2018 2:16:34 AM
>>>>> To: User
>>>>> Subject: Re: vnodes: high availability
>>>>>  
>>>>> Yeah it's very unlikely that you will have 2 nodes in the cluster with NO 
>>>>> intersecting token ranges (vnodes) for an RF of 3 (probably even 2).
>>>>> 
>>>>> If node A goes down all 256 ranges will go down, and considering there 
>>>>> are only 49 other nodes all with 256 vnodes each, it's very likely that 
>>>>> every node will be responsible for some range A was also responsible for. 
>>>>> I'm not sure what the exact math is, but think of it this way: If on each 
>>>>> node, any of its 256 token ranges overlap (it's within the next RF-1 or 
>>>>> previous RF-1 token ranges) on the ring with a token range on node A 
>>>>> those token ranges will be down at QUORUM. 
>>>>> 
>>>>> Because token range assignment just uses rand() under the hood, I'm sure 
>>>>> you could prove that it's always going to be the case that any 2 nodes 
>>>>> going down result in a loss of QUORUM for some token range.
>>>>> 
>>>>> On 15 January 2018 at 19:59, Kyrylo Lebediev <kyrylo_lebed...@epam.com 
>>>>> <mailto:kyrylo_lebed...@epam.com>> wrote:
>>>>> Thanks Alexander!
>>>>> 
>>>>> I'm not a MS in math too) Unfortunately.
>>>>> 
>>>>> Not sure, but it seems to me that probability of 2/49 in your explanation 
>>>>> doesn't take into account that vnodes endpoints are almost evenly 
>>>>> distributed across all nodes (al least it's what I can see from "nodetool 
>>>>> ring" output).
>>>>> 
>>>>> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
>>>>>  
>>>>> <http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html>
>>>>> of course this vnodes illustration is a theoretical one, but there no 2 
>>>>> nodes on that diagram that can be switched off without losing a key range 
>>>>> (at CL=QUORUM). 
>>>>> 
>>>>> That's because vnodes_per_node=8 > Nnodes=6.
>>>>> As far as I understand, situation is getting worse with increase of 
>>>>> vnodes_per_node/Nnode ratio.
>>>>> Please, correct me if I'm wrong.
>>>>> 
>>>>> How would the situation differ from this example by DataStax, if we had a 
>>>>> real-life 6-nodes cluster with 8 vnodes on each node? 
>>>>> 
>>>>> Regards, 
>>>>> Kyrill
>>>>> 
>>>>> From: Alexander Dejanovski <a...@thelastpickle.com 
>>>>> <mailto:a...@thelastpickle.com>>
>>>>> Sent: Monday, January 15, 2018 8:14:21 PM
>>>>> 
>>>>> To: user@cassandra.apache.org <mailto:user@cassandra.apache.org>
>>>>> Subject: Re: vnodes: high availability
>>>>>  
>>>>> I was corrected off list that the odds of losing data when 2 nodes are 
>>>>> down isn't dependent on the number of vnodes, but only on the number of 
>>>>> nodes.
>>>>> The more vnodes, the smaller the chunks of data you may lose, and vice 
>>>>> versa.
>>>>> I officially suck at statistics, as expected :)
>>>>> 
>>>>> Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski 
>>>>> <a...@thelastpickle.com <mailto:a...@thelastpickle.com>> a écrit :
>>>>> Hi Kyrylo,
>>>>> 
>>>>> the situation is a bit more nuanced than shown by the Datastax diagram, 
>>>>> which is fairly theoretical.
>>>>> If you're using SimpleStrategy, there is no rack awareness. Since vnode 
>>>>> distribution is purely random, and the replica for a vnode will be placed 
>>>>> on the node that owns the next vnode in token order (yeah, that's not 
>>>>> easy to formulate), you end up with statistics only.
>>>>> 
>>>>> I kinda suck at maths but I'm going to risk making a fool of myself :)
>>>>> 
>>>>> The odds for one vnode to be replicated on another node are, in your 
>>>>> case, 2/49 (out of 49 remaining nodes, 2 replicas need to be placed).
>>>>> Given you have 256 vnodes, the odds for at least one vnode of a single 
>>>>> node to exist on another one is 256*(2/49) = 10.4%
>>>>> Since the relationship is bi-directional (there are the same odds for 
>>>>> node B to have a vnode replicated on node A than the opposite), that 
>>>>> doubles the odds of 2 nodes being both replica for at least one vnode : 
>>>>> 20.8%.
>>>>> 
>>>>> Having a smaller number of vnodes will decrease the odds, just as having 
>>>>> more nodes in the cluster.
>>>>> (now once again, I hope my maths aren't fully wrong, I'm pretty rusty in 
>>>>> that area...)
>>>>> 
>>>>> How many queries that will affect is a different question as it depends 
>>>>> on which partition currently exist and are queried in the unavailable 
>>>>> token ranges.
>>>>> 
>>>>> Then you have rack awareness that comes with NetworkTopologyStrategy : 
>>>>> If the number of replicas (3 in your case) is proportional to the number 
>>>>> of racks, Cassandra will spread replicas in different ones.
>>>>> In that situation, you can theoretically lose as many nodes as you want 
>>>>> in a single rack, you will still have two other replicas available to 
>>>>> satisfy quorum in the remaining racks.
>>>>> If you start losing nodes in different racks, we're back to doing maths 
>>>>> (but the odds will get slightly different).
>>>>> 
>>>>> That makes maintenance predictable because you can shut down as many 
>>>>> nodes as you want in a single rack without losing QUORUM.
>>>>> 
>>>>> Feel free to correct my numbers if I'm wrong.
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev <kyrylo_lebed...@epam.com 
>>>>> <mailto:kyrylo_lebed...@epam.com>> wrote:
>>>>> Thanks, Rahul.
>>>>> But in your example, at the same time loss of Node3 and Node6 leads to 
>>>>> loss of ranges N, C, J at consistency level QUORUM.
>>>>> 
>>>>> As far as I understand in case vnodes > N_nodes_in_cluster and 
>>>>> endpoint_snitch=SimpleSnitch, since:
>>>>> 
>>>>> 1) "secondary" replicas are placed on two nodes 'next' to the node 
>>>>> responsible for a range (in case of RF=3)
>>>>> 2) there are a lot of vnodes on each node
>>>>> 3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,
>>>>> 
>>>>> we get all physical nodes (servers) having mutually adjacent  token rages.
>>>>> Is it correct?
>>>>> 
>>>>> At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3 
>>>>> for this command:
>>>>> 
>>>>> nodetool ring | grep '^<ip-prefix>' | awk '{print $1}' | uniq | grep -B2 
>>>>> -A2 '<ip_of_a_node>' | grep -v '<ip_of_a_node>' | grep -v '^--' | sort | 
>>>>> uniq | wc -l
>>>>> 
>>>>> returned number which equals to Nnodes -1, what means that I can't switch 
>>>>> off 2 nodes at the same time w/o losing of some keyrange for CL=QUORUM.
>>>>> 
>>>>> Thanks,
>>>>> Kyrill
>>>>> From: Rahul Neelakantan <ra...@rahul.be <mailto:ra...@rahul.be>>
>>>>> Sent: Monday, January 15, 2018 5:20:20 PM
>>>>> To: user@cassandra.apache.org <mailto:user@cassandra.apache.org>
>>>>> Subject: Re: vnodes: high availability
>>>>>  
>>>>> Not necessarily. It depends on how the token ranges for the vNodes are 
>>>>> assigned to them. For example take a look at this diagram 
>>>>> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
>>>>>  
>>>>> <http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html>
>>>>> 
>>>>> In the vNode part of the diagram, you will see that Loss of Node 3 and 
>>>>> Node 6, will still not have any effect on Token Range A. But yes if you 
>>>>> lose two nodes that both have Token Range A assigned to them (Say Node 1 
>>>>> and Node 2), you will have unavailability with your specified 
>>>>> configuration.
>>>>> 
>>>>> You can sort of circumvent this by using the DataStax Java Driver and 
>>>>> having the client recognize a degraded cluster and operate temporarily in 
>>>>> downgraded consistency mode
>>>>> 
>>>>> http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html
>>>>>  
>>>>> <http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html>
>>>>> 
>>>>> - Rahul
>>>>> 
>>>>> On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev 
>>>>> <kyrylo_lebed...@epam.com <mailto:kyrylo_lebed...@epam.com>> wrote:
>>>>> Hi, 
>>>>> 
>>>>> Let's say we have a C* cluster with following parameters:
>>>>>  - 50 nodes in the cluster
>>>>>  - RF=3 
>>>>>  - vnodes=256 per node
>>>>>  - CL for some queries = QUORUM
>>>>>  - endpoint_snitch = SimpleSnitch
>>>>> 
>>>>> Is it correct that 2 any nodes down will cause unavailability of a 
>>>>> keyrange at CL=QUORUM?
>>>>> 
>>>>> Regards, 
>>>>> Kyrill
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> -----------------
>>>>> Alexander Dejanovski
>>>>> France
>>>>> @alexanderdeja
>>>>> 
>>>>> Consultant
>>>>> Apache Cassandra Consulting
>>>>> http://www.thelastpickle.com <http://www.thelastpickle.com/>
>>>>> -- 
>>>>> -----------------
>>>>> Alexander Dejanovski
>>>>> France
>>>>> @alexanderdeja
>>>>> 
>>>>> Consultant
>>>>> Apache Cassandra Consulting
>>>>> http://www.thelastpickle.com <http://www.thelastpickle.com/>

Re: vnodes: high availability

Reply via email to