If this is a universal recommendation, then should that actually be default in Cassandra?
Hannu > On 18 Jan 2018, at 00:49, Jon Haddad <j...@jonhaddad.com> wrote: > > I *strongly* recommend disabling dynamic snitch. I’ve seen it make latency > jump 10x. > > dynamic_snitch: false is your friend. > > > >> On Jan 17, 2018, at 2:00 PM, Kyrylo Lebediev <kyrylo_lebed...@epam.com >> <mailto:kyrylo_lebed...@epam.com>> wrote: >> >> Avi, >> If we prefer to have better balancing [like absence of hotspots during a >> node down event etc], large number of vnodes is a good solution. >> Personally, I wouldn't prefer any balancing over overall resiliency (and in >> case of non-optimal setup, larger number of nodes in a cluster decreases >> overall resiliency, as far as I understand.) >> >> Talking about hotspots, there is a number of features helping to mitigate >> the issue, for example: >> - dynamic snitch [if a node overloaded it won't be queried] >> - throttling of streaming operations >> >> Thanks, >> Kyrill >> >> From: Avi Kivity <a...@scylladb.com <mailto:a...@scylladb.com>> >> Sent: Wednesday, January 17, 2018 2:50 PM >> To: user@cassandra.apache.org <mailto:user@cassandra.apache.org>; kurt >> greaves >> Subject: Re: vnodes: high availability >> >> On the flip side, a large number of vnodes is also beneficial. For example, >> if you add a node to a 20-node cluster with many vnodes, each existing node >> will contribute 5% of the data towards the new node, and all nodes will >> participate in streaming (meaning the impact on any single node will be >> limited, and completion time will be faster). >> >> With a low number of vnodes, only a few nodes participate in streaming, >> which means that the cluster is left unbalanced and the impact on each >> streaming node is greater (or that completion time is slower). >> >> Similarly, with a high number of vnodes, if a node is down its work is >> distributed equally among all nodes. With a low number of vnodes the cluster >> becomes unbalanced. >> >> Overall I recommend high vnode count, and to limit the impact of failures in >> other ways (smaller number of large nodes vs. larger number of small nodes). >> >> btw, rack-aware topology improves the multi-failure problem but at the cost >> of causing imbalance during maintenance operations. I recommend using >> rack-aware topology only if you really have racks with >> single-points-of-failure, not for other reasons. >> >> On 01/17/2018 05:43 AM, kurt greaves wrote: >>> Even with a low amount of vnodes you're asking for a bad time. Even if you >>> managed to get down to 2 vnodes per node, you're still likely to include >>> double the amount of nodes in any streaming/repair operation which will >>> likely be very problematic for incremental repairs, and you still won't be >>> able to easily reason about which nodes are responsible for which token >>> ranges. It's still quite likely that a loss of 2 nodes would mean some >>> portion of the ring is down (at QUORUM). At the moment I'd say steer clear >>> of vnodes and use single tokens if you can; a lot of work still needs to be >>> done to ensure smooth operation of C* while using vnodes, and they are much >>> more difficult to reason about (which is probably the reason no one has >>> bothered to do the math). If you're really keen on the math your best bet >>> is to do it yourself, because it's not a point of interest for many C* devs >>> plus probably a lot of us wouldn't remember enough math to know how to >>> approach it. >>> >>> If you want to get out of this situation you'll need to do a DC migration >>> to a new DC with a better configuration of snitch/replication >>> strategy/racks/tokens. >>> >>> >>> On 16 January 2018 at 21:54, Kyrylo Lebediev <kyrylo_lebed...@epam.com >>> <mailto:kyrylo_lebed...@epam.com>> wrote: >>> Thank you for this valuable info, Jon. >>> I guess both you and Alex are referring to improved vnodes allocation >>> method https://issues.apache.org/jira/browse/CASSANDRA-7032 >>> <https://issues.apache.org/jira/browse/CASSANDRA-7032> which was >>> implemented in 3.0. >>> Based on your info and comments in the ticket it's really a bad idea to >>> have small number of vnodes for the versions using old allocation method >>> because of hot-spots, so it's not an option for my particular case (v.2.1) >>> :( >>> >>> [As far as I can see from the source code this new method wasn't backported >>> to 2.1.] >>> >>> >>> Regards, >>> Kyrill >>> [CASSANDRA-7032] Improve vnode allocation - ASF JIRA >>> <https://issues.apache.org/jira/browse/CASSANDRA-7032> >>> issues.apache.org <http://issues.apache.org/> >>> It's been known for a little while that random vnode allocation causes >>> hotspots of ownership. It should be possible to improve dramatically on >>> this with deterministic ... >>> >>> From: Jon Haddad <jonathan.had...@gmail.com >>> <mailto:jonathan.had...@gmail.com>> on behalf of Jon Haddad >>> <j...@jonhaddad.com <mailto:j...@jonhaddad.com>> >>> Sent: Tuesday, January 16, 2018 8:21:33 PM >>> >>> To: user@cassandra.apache.org <mailto:user@cassandra.apache.org> >>> Subject: Re: vnodes: high availability >>> >>> We’ve used 32 tokens pre 3.0. It’s been a mixed result due to the >>> randomness. There’s going to be some imbalance, the amount of imbalance >>> depends on luck, unfortunately. >>> >>> I’m interested to hear your results using 4 tokens, would you mind letting >>> the ML know your experience when you’ve done it? >>> >>> Jon >>> >>>> On Jan 16, 2018, at 9:40 AM, Kyrylo Lebediev <kyrylo_lebed...@epam.com >>>> <mailto:kyrylo_lebed...@epam.com>> wrote: >>>> >>>> Agree with you, Jon. >>>> Actually, this cluster was configured by my 'predecessor' and [fortunately >>>> for him] we've never met :) >>>> We're using version 2.1.15 and can't upgrade because of legacy Netflix >>>> Astyanax client used. >>>> >>>> Below in the thread Alex mentioned that it's recommended to set vnodes to >>>> a value lower than 256 only for C* version > 3.0 (token allocation >>>> algorithm was improved since C* 3.0) . >>>> >>>> Jon, >>>> Do you have positive experience setting up cluster with vnodes < 256 for >>>> C* 2.1? >>>> >>>> vnodes=32 also too high, as for me (we need to have much more than 32 >>>> servers per AZ in order to to get 'reliable' cluster) >>>> vnodes=4 seems to be better from HA + balancing trade-off >>>> >>>> Thanks, >>>> Kyrill >>>> From: Jon Haddad <jonathan.had...@gmail.com >>>> <mailto:jonathan.had...@gmail.com>> on behalf of Jon Haddad >>>> <j...@jonhaddad.com <mailto:j...@jonhaddad.com>> >>>> Sent: Tuesday, January 16, 2018 6:44:53 PM >>>> To: user >>>> Subject: Re: vnodes: high availability >>>> >>>> While all the token math is helpful, I have to also call out the elephant >>>> in the room: >>>> >>>> You have not correctly configured Cassandra for production. >>>> >>>> If you had used the correct endpoint snitch & network topology strategy, >>>> you would be able to withstand the complete failure of an entire >>>> availability zone at QUORUM, or two if you queried at CL=ONE. >>>> >>>> You are correct about 256 tokens causing issues, it’s one of the reasons >>>> why we recommend 32. I’m curious how things behave going as low as 4, >>>> personally, but I haven’t done the math / tested it yet. >>>> >>>> >>>> >>>>> On Jan 16, 2018, at 2:02 AM, Kyrylo Lebediev <kyrylo_lebed...@epam.com >>>>> <mailto:kyrylo_lebed...@epam.com>> wrote: >>>>> >>>>> ...to me it sounds like 'C* isn't that highly-available by design as it's >>>>> declared'. >>>>> More nodes in a cluster means higher probability of simultaneous node >>>>> failures. >>>>> And from high-availability standpoint, looks like situation is made even >>>>> worse by recommended setting vnodes=256. >>>>> >>>>> Need to do some math to get numbers/formulas, but now situation doesn't >>>>> seem to be promising. >>>>> In case smb from C* developers/architects is reading this message, I'd be >>>>> grateful to get some links to calculations of C* reliability based on >>>>> which decisions were made. >>>>> >>>>> Regards, >>>>> Kyrill >>>>> From: kurt greaves <k...@instaclustr.com <mailto:k...@instaclustr.com>> >>>>> Sent: Tuesday, January 16, 2018 2:16:34 AM >>>>> To: User >>>>> Subject: Re: vnodes: high availability >>>>> >>>>> Yeah it's very unlikely that you will have 2 nodes in the cluster with NO >>>>> intersecting token ranges (vnodes) for an RF of 3 (probably even 2). >>>>> >>>>> If node A goes down all 256 ranges will go down, and considering there >>>>> are only 49 other nodes all with 256 vnodes each, it's very likely that >>>>> every node will be responsible for some range A was also responsible for. >>>>> I'm not sure what the exact math is, but think of it this way: If on each >>>>> node, any of its 256 token ranges overlap (it's within the next RF-1 or >>>>> previous RF-1 token ranges) on the ring with a token range on node A >>>>> those token ranges will be down at QUORUM. >>>>> >>>>> Because token range assignment just uses rand() under the hood, I'm sure >>>>> you could prove that it's always going to be the case that any 2 nodes >>>>> going down result in a loss of QUORUM for some token range. >>>>> >>>>> On 15 January 2018 at 19:59, Kyrylo Lebediev <kyrylo_lebed...@epam.com >>>>> <mailto:kyrylo_lebed...@epam.com>> wrote: >>>>> Thanks Alexander! >>>>> >>>>> I'm not a MS in math too) Unfortunately. >>>>> >>>>> Not sure, but it seems to me that probability of 2/49 in your explanation >>>>> doesn't take into account that vnodes endpoints are almost evenly >>>>> distributed across all nodes (al least it's what I can see from "nodetool >>>>> ring" output). >>>>> >>>>> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html >>>>> >>>>> <http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html> >>>>> of course this vnodes illustration is a theoretical one, but there no 2 >>>>> nodes on that diagram that can be switched off without losing a key range >>>>> (at CL=QUORUM). >>>>> >>>>> That's because vnodes_per_node=8 > Nnodes=6. >>>>> As far as I understand, situation is getting worse with increase of >>>>> vnodes_per_node/Nnode ratio. >>>>> Please, correct me if I'm wrong. >>>>> >>>>> How would the situation differ from this example by DataStax, if we had a >>>>> real-life 6-nodes cluster with 8 vnodes on each node? >>>>> >>>>> Regards, >>>>> Kyrill >>>>> >>>>> From: Alexander Dejanovski <a...@thelastpickle.com >>>>> <mailto:a...@thelastpickle.com>> >>>>> Sent: Monday, January 15, 2018 8:14:21 PM >>>>> >>>>> To: user@cassandra.apache.org <mailto:user@cassandra.apache.org> >>>>> Subject: Re: vnodes: high availability >>>>> >>>>> I was corrected off list that the odds of losing data when 2 nodes are >>>>> down isn't dependent on the number of vnodes, but only on the number of >>>>> nodes. >>>>> The more vnodes, the smaller the chunks of data you may lose, and vice >>>>> versa. >>>>> I officially suck at statistics, as expected :) >>>>> >>>>> Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski >>>>> <a...@thelastpickle.com <mailto:a...@thelastpickle.com>> a écrit : >>>>> Hi Kyrylo, >>>>> >>>>> the situation is a bit more nuanced than shown by the Datastax diagram, >>>>> which is fairly theoretical. >>>>> If you're using SimpleStrategy, there is no rack awareness. Since vnode >>>>> distribution is purely random, and the replica for a vnode will be placed >>>>> on the node that owns the next vnode in token order (yeah, that's not >>>>> easy to formulate), you end up with statistics only. >>>>> >>>>> I kinda suck at maths but I'm going to risk making a fool of myself :) >>>>> >>>>> The odds for one vnode to be replicated on another node are, in your >>>>> case, 2/49 (out of 49 remaining nodes, 2 replicas need to be placed). >>>>> Given you have 256 vnodes, the odds for at least one vnode of a single >>>>> node to exist on another one is 256*(2/49) = 10.4% >>>>> Since the relationship is bi-directional (there are the same odds for >>>>> node B to have a vnode replicated on node A than the opposite), that >>>>> doubles the odds of 2 nodes being both replica for at least one vnode : >>>>> 20.8%. >>>>> >>>>> Having a smaller number of vnodes will decrease the odds, just as having >>>>> more nodes in the cluster. >>>>> (now once again, I hope my maths aren't fully wrong, I'm pretty rusty in >>>>> that area...) >>>>> >>>>> How many queries that will affect is a different question as it depends >>>>> on which partition currently exist and are queried in the unavailable >>>>> token ranges. >>>>> >>>>> Then you have rack awareness that comes with NetworkTopologyStrategy : >>>>> If the number of replicas (3 in your case) is proportional to the number >>>>> of racks, Cassandra will spread replicas in different ones. >>>>> In that situation, you can theoretically lose as many nodes as you want >>>>> in a single rack, you will still have two other replicas available to >>>>> satisfy quorum in the remaining racks. >>>>> If you start losing nodes in different racks, we're back to doing maths >>>>> (but the odds will get slightly different). >>>>> >>>>> That makes maintenance predictable because you can shut down as many >>>>> nodes as you want in a single rack without losing QUORUM. >>>>> >>>>> Feel free to correct my numbers if I'm wrong. >>>>> >>>>> Cheers, >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev <kyrylo_lebed...@epam.com >>>>> <mailto:kyrylo_lebed...@epam.com>> wrote: >>>>> Thanks, Rahul. >>>>> But in your example, at the same time loss of Node3 and Node6 leads to >>>>> loss of ranges N, C, J at consistency level QUORUM. >>>>> >>>>> As far as I understand in case vnodes > N_nodes_in_cluster and >>>>> endpoint_snitch=SimpleSnitch, since: >>>>> >>>>> 1) "secondary" replicas are placed on two nodes 'next' to the node >>>>> responsible for a range (in case of RF=3) >>>>> 2) there are a lot of vnodes on each node >>>>> 3) ranges are evenly distribusted between vnodes in case of SimpleSnitch, >>>>> >>>>> we get all physical nodes (servers) having mutually adjacent token rages. >>>>> Is it correct? >>>>> >>>>> At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3 >>>>> for this command: >>>>> >>>>> nodetool ring | grep '^<ip-prefix>' | awk '{print $1}' | uniq | grep -B2 >>>>> -A2 '<ip_of_a_node>' | grep -v '<ip_of_a_node>' | grep -v '^--' | sort | >>>>> uniq | wc -l >>>>> >>>>> returned number which equals to Nnodes -1, what means that I can't switch >>>>> off 2 nodes at the same time w/o losing of some keyrange for CL=QUORUM. >>>>> >>>>> Thanks, >>>>> Kyrill >>>>> From: Rahul Neelakantan <ra...@rahul.be <mailto:ra...@rahul.be>> >>>>> Sent: Monday, January 15, 2018 5:20:20 PM >>>>> To: user@cassandra.apache.org <mailto:user@cassandra.apache.org> >>>>> Subject: Re: vnodes: high availability >>>>> >>>>> Not necessarily. It depends on how the token ranges for the vNodes are >>>>> assigned to them. For example take a look at this diagram >>>>> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html >>>>> >>>>> <http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html> >>>>> >>>>> In the vNode part of the diagram, you will see that Loss of Node 3 and >>>>> Node 6, will still not have any effect on Token Range A. But yes if you >>>>> lose two nodes that both have Token Range A assigned to them (Say Node 1 >>>>> and Node 2), you will have unavailability with your specified >>>>> configuration. >>>>> >>>>> You can sort of circumvent this by using the DataStax Java Driver and >>>>> having the client recognize a degraded cluster and operate temporarily in >>>>> downgraded consistency mode >>>>> >>>>> http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html >>>>> >>>>> <http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html> >>>>> >>>>> - Rahul >>>>> >>>>> On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev >>>>> <kyrylo_lebed...@epam.com <mailto:kyrylo_lebed...@epam.com>> wrote: >>>>> Hi, >>>>> >>>>> Let's say we have a C* cluster with following parameters: >>>>> - 50 nodes in the cluster >>>>> - RF=3 >>>>> - vnodes=256 per node >>>>> - CL for some queries = QUORUM >>>>> - endpoint_snitch = SimpleSnitch >>>>> >>>>> Is it correct that 2 any nodes down will cause unavailability of a >>>>> keyrange at CL=QUORUM? >>>>> >>>>> Regards, >>>>> Kyrill >>>>> >>>>> >>>>> >>>>> -- >>>>> ----------------- >>>>> Alexander Dejanovski >>>>> France >>>>> @alexanderdeja >>>>> >>>>> Consultant >>>>> Apache Cassandra Consulting >>>>> http://www.thelastpickle.com <http://www.thelastpickle.com/> >>>>> -- >>>>> ----------------- >>>>> Alexander Dejanovski >>>>> France >>>>> @alexanderdeja >>>>> >>>>> Consultant >>>>> Apache Cassandra Consulting >>>>> http://www.thelastpickle.com <http://www.thelastpickle.com/>