Re: vnodes: high availability

Kyrylo Lebediev Wed, 17 Jan 2018 14:01:12 -0800

Avi,

If we prefer to have better balancing [like absence of hotspots during a node 
down event etc], large number of vnodes is a good solution.


Personally, I wouldn't prefer any balancing over overall resiliency  (and in 
case of non-optimal setup, larger number of nodes in a cluster decreases 
overall resiliency, as far as I understand.)


Talking about hotspots, there is a number of features helping to mitigate the 
issue, for example:

  - dynamic snitch [if a node overloaded it won't be queried]

  - throttling of streaming operations

Thanks,
Kyrill

________________________________
From: Avi Kivity <[email protected]>
Sent: Wednesday, January 17, 2018 2:50 PM
To: [email protected]; kurt greaves
Subject: Re: vnodes: high availability


On the flip side, a large number of vnodes is also beneficial. For example, if 
you add a node to a 20-node cluster with many vnodes, each existing node will 
contribute 5% of the data towards the new node, and all nodes will participate 
in streaming (meaning the impact on any single node will be limited, and 
completion time will be faster).


With a low number of vnodes, only a few nodes participate in streaming, which 
means that the cluster is left unbalanced and the impact on each streaming node 
is greater (or that completion time is slower).


Similarly, with a high number of vnodes, if a node is down its work is 
distributed equally among all nodes. With a low number of vnodes the cluster 
becomes unbalanced.


Overall I recommend high vnode count, and to limit the impact of failures in 
other ways (smaller number of large nodes vs. larger number of small nodes).


btw, rack-aware topology improves the multi-failure problem but at the cost of 
causing imbalance during maintenance operations. I recommend using rack-aware 
topology only if you really have racks with single-points-of-failure, not for 
other reasons.

On 01/17/2018 05:43 AM, kurt greaves wrote:
Even with a low amount of vnodes you're asking for a bad time. Even if you 
managed to get down to 2 vnodes per node, you're still likely to include double 
the amount of nodes in any streaming/repair operation which will likely be very 
problematic for incremental repairs, and you still won't be able to easily 
reason about which nodes are responsible for which token ranges. It's still 
quite likely that a loss of 2 nodes would mean some portion of the ring is down 
(at QUORUM). At the moment I'd say steer clear of vnodes and use single tokens 
if you can; a lot of work still needs to be done to ensure smooth operation of 
C* while using vnodes, and they are much more difficult to reason about (which 
is probably the reason no one has bothered to do the math). If you're really 
keen on the math your best bet is to do it yourself, because it's not a point 
of interest for many C* devs plus probably a lot of us wouldn't remember enough 
math to know how to approach it.

If you want to get out of this situation you'll need to do a DC migration to a 
new DC with a better configuration of snitch/replication strategy/racks/tokens.


On 16 January 2018 at 21:54, Kyrylo Lebediev 
<[email protected]<mailto:[email protected]>> wrote:

Thank you for this valuable info, Jon.
I guess both you and Alex are referring to improved vnodes allocation method  
https://issues.apache.org/jira/browse/CASSANDRA-7032 which was implemented in 
3.0.

Based on your info and comments in the ticket it's really a bad idea to have 
small number of vnodes for the versions using old allocation method because of 
hot-spots, so it's not an option for my particular case (v.2.1) :(

[As far as I can see from the source code this new method wasn't backported to 
2.1.]



Regards,

Kyrill

[CASSANDRA-7032] Improve vnode allocation - ASF 
JIRA<https://issues.apache.org/jira/browse/CASSANDRA-7032>
issues.apache.org<http://issues.apache.org>
It's been known for a little while that random vnode allocation causes hotspots 
of ownership. It should be possible to improve dramatically on this with 
deterministic ...


________________________________
From: Jon Haddad <[email protected]<mailto:[email protected]>> 
on behalf of Jon Haddad <[email protected]<mailto:[email protected]>>
Sent: Tuesday, January 16, 2018 8:21:33 PM

To: [email protected]<mailto:[email protected]>
Subject: Re: vnodes: high availability

We’ve used 32 tokens pre 3.0.  It’s been a mixed result due to the randomness.  
There’s going to be some imbalance, the amount of imbalance depends on luck, 
unfortunately.

I’m interested to hear your results using 4 tokens, would you mind letting the 
ML know your experience when you’ve done it?

Jon

On Jan 16, 2018, at 9:40 AM, Kyrylo Lebediev 
<[email protected]<mailto:[email protected]>> wrote:

Agree with you, Jon.
Actually, this cluster was configured by my 'predecessor' and [fortunately for 
him] we've never met :)
We're using version 2.1.15 and can't upgrade because of legacy Netflix Astyanax 
client used.

Below in the thread Alex mentioned that it's recommended to set vnodes to a 
value lower than 256 only for C* version > 3.0 (token allocation algorithm was 
improved since C* 3.0) .

Jon,
Do you have positive experience setting up  cluster with vnodes < 256 for  C* 
2.1?

vnodes=32 also too high, as for me (we need to have much more than 32 servers 
per AZ in order to to get 'reliable' cluster)
vnodes=4 seems to be better from HA + balancing trade-off

Thanks,
Kyrill
________________________________
From: Jon Haddad <[email protected]<mailto:[email protected]>> 
on behalf of Jon Haddad <[email protected]<mailto:[email protected]>>
Sent: Tuesday, January 16, 2018 6:44:53 PM
To: user
Subject: Re: vnodes: high availability

While all the token math is helpful, I have to also call out the elephant in 
the room:

You have not correctly configured Cassandra for production.

If you had used the correct endpoint snitch & network topology strategy, you 
would be able to withstand the complete failure of an entire availability zone 
at QUORUM, or two if you queried at CL=ONE.

You are correct about 256 tokens causing issues, it’s one of the reasons why we 
recommend 32.  I’m curious how things behave going as low as 4, personally, but 
I haven’t done the math / tested it yet.



On Jan 16, 2018, at 2:02 AM, Kyrylo Lebediev 
<[email protected]<mailto:[email protected]>> wrote:

...to me it sounds like 'C* isn't that highly-available by design as it's 
declared'.
More nodes in a cluster means higher probability of simultaneous node failures.
And from high-availability standpoint, looks like situation is made even worse 
by recommended setting vnodes=256.

Need to do some math to get numbers/formulas, but now situation doesn't seem to 
be promising.
In case smb from C* developers/architects is reading this message, I'd be 
grateful to get some links to calculations of C* reliability based on which 
decisions were made.

Regards,
Kyrill
________________________________
From: kurt greaves <[email protected]<mailto:[email protected]>>
Sent: Tuesday, January 16, 2018 2:16:34 AM
To: User
Subject: Re: vnodes: high availability

Yeah it's very unlikely that you will have 2 nodes in the cluster with NO 
intersecting token ranges (vnodes) for an RF of 3 (probably even 2).

If node A goes down all 256 ranges will go down, and considering there are only 
49 other nodes all with 256 vnodes each, it's very likely that every node will 
be responsible for some range A was also responsible for. I'm not sure what the 
exact math is, but think of it this way: If on each node, any of its 256 token 
ranges overlap (it's within the next RF-1 or previous RF-1 token ranges) on the 
ring with a token range on node A those token ranges will be down at QUORUM.

Because token range assignment just uses rand() under the hood, I'm sure you 
could prove that it's always going to be the case that any 2 nodes going down 
result in a loss of QUORUM for some token range.

On 15 January 2018 at 19:59, Kyrylo Lebediev 
<[email protected]<mailto:[email protected]>> wrote:
Thanks Alexander!

I'm not a MS in math too) Unfortunately.

Not sure, but it seems to me that probability of 2/49 in your explanation 
doesn't take into account that vnodes endpoints are almost evenly distributed 
across all nodes (al least it's what I can see from "nodetool ring" output).

http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
of course this vnodes illustration is a theoretical one, but there no 2 nodes 
on that diagram that can be switched off without losing a key range (at 
CL=QUORUM).

That's because vnodes_per_node=8 > Nnodes=6.
As far as I understand, situation is getting worse with increase of 
vnodes_per_node/Nnode ratio.
Please, correct me if I'm wrong.

How would the situation differ from this example by DataStax, if we had a 
real-life 6-nodes cluster with 8 vnodes on each node?

Regards,
Kyrill

________________________________
From: Alexander Dejanovski 
<[email protected]<mailto:[email protected]>>
Sent: Monday, January 15, 2018 8:14:21 PM

To: [email protected]<mailto:[email protected]>
Subject: Re: vnodes: high availability

I was corrected off list that the odds of losing data when 2 nodes are down 
isn't dependent on the number of vnodes, but only on the number of nodes.
The more vnodes, the smaller the chunks of data you may lose, and vice versa.
I officially suck at statistics, as expected :)

Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski 
<[email protected]<mailto:[email protected]>> a écrit :
Hi Kyrylo,

the situation is a bit more nuanced than shown by the Datastax diagram, which 
is fairly theoretical.
If you're using SimpleStrategy, there is no rack awareness. Since vnode 
distribution is purely random, and the replica for a vnode will be placed on 
the node that owns the next vnode in token order (yeah, that's not easy to 
formulate), you end up with statistics only.

I kinda suck at maths but I'm going to risk making a fool of myself :)

The odds for one vnode to be replicated on another node are, in your case, 2/49 
(out of 49 remaining nodes, 2 replicas need to be placed).
Given you have 256 vnodes, the odds for at least one vnode of a single node to 
exist on another one is 256*(2/49) = 10.4%
Since the relationship is bi-directional (there are the same odds for node B to 
have a vnode replicated on node A than the opposite), that doubles the odds of 
2 nodes being both replica for at least one vnode : 20.8%.

Having a smaller number of vnodes will decrease the odds, just as having more 
nodes in the cluster.
(now once again, I hope my maths aren't fully wrong, I'm pretty rusty in that 
area...)

How many queries that will affect is a different question as it depends on 
which partition currently exist and are queried in the unavailable token ranges.

Then you have rack awareness that comes with NetworkTopologyStrategy :
If the number of replicas (3 in your case) is proportional to the number of 
racks, Cassandra will spread replicas in different ones.
In that situation, you can theoretically lose as many nodes as you want in a 
single rack, you will still have two other replicas available to satisfy quorum 
in the remaining racks.
If you start losing nodes in different racks, we're back to doing maths (but 
the odds will get slightly different).

That makes maintenance predictable because you can shut down as many nodes as 
you want in a single rack without losing QUORUM.

Feel free to correct my numbers if I'm wrong.

Cheers,





On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev 
<[email protected]<mailto:[email protected]>> wrote:
Thanks, Rahul.
But in your example, at the same time loss of Node3 and Node6 leads to loss of 
ranges N, C, J at consistency level QUORUM.

As far as I understand in case vnodes > N_nodes_in_cluster and 
endpoint_snitch=SimpleSnitch, since:

1) "secondary" replicas are placed on two nodes 'next' to the node responsible 
for a range (in case of RF=3)
2) there are a lot of vnodes on each node
3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,

we get all physical nodes (servers) having mutually adjacent  token rages.
Is it correct?

At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3 for 
this command:

nodetool ring | grep '^<ip-prefix>' | awk '{print $1}' | uniq | grep -B2 -A2 
'<ip_of_a_node>' | grep -v '<ip_of_a_node>' | grep -v '^--' | sort | uniq | wc 
-l

returned number which equals to Nnodes -1, what means that I can't switch off 2 
nodes at the same time w/o losing of some keyrange for CL=QUORUM.

Thanks,
Kyrill
________________________________
From: Rahul Neelakantan <[email protected]<mailto:[email protected]>>
Sent: Monday, January 15, 2018 5:20:20 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: vnodes: high availability

Not necessarily. It depends on how the token ranges for the vNodes are assigned 
to them. For example take a look at this diagram
http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html

In the vNode part of the diagram, you will see that Loss of Node 3 and Node 6, 
will still not have any effect on Token Range A. But yes if you lose two nodes 
that both have Token Range A assigned to them (Say Node 1 and Node 2), you will 
have unavailability with your specified configuration.

You can sort of circumvent this by using the DataStax Java Driver and having 
the client recognize a degraded cluster and operate temporarily in downgraded 
consistency mode

http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html

- Rahul

On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

Let's say we have a C* cluster with following parameters:
 - 50 nodes in the cluster
 - RF=3
 - vnodes=256 per node
 - CL for some queries = QUORUM
 - endpoint_snitch = SimpleSnitch

Is it correct that 2 any nodes down will cause unavailability of a keyrange at 
CL=QUORUM?

Regards,
Kyrill



--
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com<http://www.thelastpickle.com/>
--
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com<http://www.thelastpickle.com/>

Re: vnodes: high availability

Reply via email to