Alex, thanks for detailed explanation.

> The more nodes you have, the smaller will be the subset of data that cannot 
> achieve quorum (so your outage is not as bad as when you have a small number 
> of nodes

Okay, let's say we lost 0.5% of keyrange [for specified CL]. Critically 
important data chunks may fall into this range.


>If you want more availability but don't want to sacrifice consistency, you can 
>raise your replication factor (if you can afford the extra disk space usage)


IMO, ~1.6x more disk space (5/3) isn't the most important drawback in this 
case.. For example, with increasing RF we could affect latency for queries with 
CL=QUORUM (will have to wait for response from 3 nodes for RF=5 instead of 
waiting for 2 nodes for RF=3). Plus, overhead for most of maintenance 
operations like anti-entropy repairs should increase with increasing of RF


>Datacenter and rack awareness built in Cassandra can help with availability 
>guarantees : 1 full rack down out of 3 will still allow QUORUM at RF=3

At the same time if we get any 2 nodes  from different racks down in a time (in 
case of high vnodes values), a keyrange becomes unavailable, as far as I 
understand. [I stick to CL=QUORUM just for brevity sake. For CL=ONE the issue 
is similar]


[In the the worst case scenario of any C* setup (with or without vnodes) if we 
lose two 'neighboring' nodes, we get outage for a keyrange for CL=QUORUM 
http://thelastpickle.com/blog/2011/06/13/Down-For-Me.html ]


--

I'm not trying criticize C* architecture which gives us good features like 
linear scalability, but I feel that some math should be done in order to  to 
elaborate setup best-practices on how to maximize availability for C* clusters 
(there are a number of best-practices, including those mentioned by Alex, but 
personally sometimes I can't see which math is behind them)


If anybody in the DL has some math prepared, please, share wit us. I guess, not 
only me is interested in getting these valuable formulas/graphs.


Thanks,

Kyrill

[http://thelastpickle.com/android-chrome-192x192.png]<http://thelastpickle.com/blog/2011/06/13/Down-For-Me.html>

Down For Me? - The Last 
Pickle<http://thelastpickle.com/blog/2011/06/13/Down-For-Me.html>
thelastpickle.com
For a read or write request to start in Cassandra at least as many nodes must 
be seen as UP by the coordinator node as the request has specified via the...



________________________________
From: Alexander Dejanovski <a...@thelastpickle.com>
Sent: Tuesday, January 16, 2018 12:50:13 PM
To: user@cassandra.apache.org
Subject: Re: vnodes: high availability

Hi Kyrylo,

high availability can be interpreted in many ways, and comes with some 
tradeoffs with consistency when things go wrong.
A few considerations here :

  *   The more nodes you have, the smaller will be the subset of data that 
cannot achieve quorum (so your outage is not as bad as when you have a small 
number of nodes)
  *   If you want more availability but don't want to sacrifice consistency, 
you can raise your replication factor (if you can afford the extra disk space 
usage)
  *   Datacenter and rack awareness built in Cassandra can help with 
availability guarantees : 1 full rack down out of 3 will still allow QUORUM at 
RF=3 and 2 racks down out of 5 at RF=5. Having one datacenter down (when using 
LOCAL_QUORUM) allows you to switch to another one and still have a working 
cluster.
  *   As mentioned in this thread, you can use downgrading retry policies to 
improve availability at the transient expense of consistency (check if your use 
case allows it)

Now about vnodes, the recommendation of using 256 is based on statistical 
analysis of data balance across clusters. Since the token allocation is fully 
random, it's been observed that 256 vnodes always gave a good balance.
If you're using a version of Cassandra >= 3.0, you can lower that to a value 
between either 16 or 32 and use the new token allocation algorithm. It will 
perform several attempts in order to balance a specific keyspace during 
bootstrap.
Using smaller numbers of vnodes will also improve repair time.
I won't go into statistics again (yikes) and leave it to people that are better 
at doing maths on how the number of vnodes per node could affect availability.

That brings us to the fact that you can fully disable vnodes and use a single 
token per node. In that case, you can be sure which nodes are replicas of the 
same tokens as it follows the ring order : With RF=3, node A tokens are 
replicated on nodes B and C, and node B tokens are replicated on nodes C and D, 
and so on.
You get more predictability as to which nodes can be taken down at the same 
time without losing QUORUM.
But you must afford the operational burden of handling tokens manually, and 
accept that growing the cluster means doubling the size each time.

The thing to consider is how your apps/services will react in case of transient 
loss of QUORUM : can you afford eventual consistency ? Is it better to endure 
full downtime or just on a subset of your partitions ?
And can you design your cluster with racks/datacenters so that you can better 
predict how to run maintenance operations or if you may be losing QUORUM ?

The way Cassandra is designed also allows linear scalability, which 
master/slave based databases cannot handle (and master/slave architectures come 
with their set of challenges, especially during network partitions).

So, while the high availability isn't as transparent as one might think (and I 
understand why you may be disappointed), you have a lot of options on how to 
react to partial downtime, and that's something you must consider both when 
designing your cluster (how it is segmented, how operations are performed), and 
when designing your apps (how you will use the driver, how your apps will react 
to failure).

Cheers,


On Tue, Jan 16, 2018 at 11:03 AM Kyrylo Lebediev 
<kyrylo_lebed...@epam.com<mailto:kyrylo_lebed...@epam.com>> wrote:

...to me it sounds like 'C* isn't that highly-available by design as it's 
declared'.

More nodes in a cluster means higher probability of simultaneous node failures.

And from high-availability standpoint, looks like situation is made even worse 
by recommended setting vnodes=256.


Need to do some math to get numbers/formulas, but now situation doesn't seem to 
be promising.

In case smb from C* developers/architects is reading this message, I'd be 
grateful to get some links to calculations of C* reliability based on which 
decisions were made.


Regards,

Kyrill

________________________________
From: kurt greaves <k...@instaclustr.com<mailto:k...@instaclustr.com>>
Sent: Tuesday, January 16, 2018 2:16:34 AM
To: User

Subject: Re: vnodes: high availability
Yeah it's very unlikely that you will have 2 nodes in the cluster with NO 
intersecting token ranges (vnodes) for an RF of 3 (probably even 2).

If node A goes down all 256 ranges will go down, and considering there are only 
49 other nodes all with 256 vnodes each, it's very likely that every node will 
be responsible for some range A was also responsible for. I'm not sure what the 
exact math is, but think of it this way: If on each node, any of its 256 token 
ranges overlap (it's within the next RF-1 or previous RF-1 token ranges) on the 
ring with a token range on node A those token ranges will be down at QUORUM.

Because token range assignment just uses rand() under the hood, I'm sure you 
could prove that it's always going to be the case that any 2 nodes going down 
result in a loss of QUORUM for some token range.

On 15 January 2018 at 19:59, Kyrylo Lebediev 
<kyrylo_lebed...@epam.com<mailto:kyrylo_lebed...@epam.com>> wrote:

Thanks Alexander!


I'm not a MS in math too) Unfortunately.


Not sure, but it seems to me that probability of 2/49 in your explanation 
doesn't take into account that vnodes endpoints are almost evenly distributed 
across all nodes (al least it's what I can see from "nodetool ring" output).


http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
of course this vnodes illustration is a theoretical one, but there no 2 nodes 
on that diagram that can be switched off without losing a key range (at 
CL=QUORUM).


That's because vnodes_per_node=8 > Nnodes=6.

As far as I understand, situation is getting worse with increase of 
vnodes_per_node/Nnode ratio.

Please, correct me if I'm wrong.


How would the situation differ from this example by DataStax, if we had a 
real-life 6-nodes cluster with 8 vnodes on each node?


Regards,

Kyrill


________________________________
From: Alexander Dejanovski 
<a...@thelastpickle.com<mailto:a...@thelastpickle.com>>
Sent: Monday, January 15, 2018 8:14:21 PM

To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: vnodes: high availability


I was corrected off list that the odds of losing data when 2 nodes are down 
isn't dependent on the number of vnodes, but only on the number of nodes.
The more vnodes, the smaller the chunks of data you may lose, and vice versa.

I officially suck at statistics, as expected :)

Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski 
<a...@thelastpickle.com<mailto:a...@thelastpickle.com>> a écrit :
Hi Kyrylo,

the situation is a bit more nuanced than shown by the Datastax diagram, which 
is fairly theoretical.
If you're using SimpleStrategy, there is no rack awareness. Since vnode 
distribution is purely random, and the replica for a vnode will be placed on 
the node that owns the next vnode in token order (yeah, that's not easy to 
formulate), you end up with statistics only.

I kinda suck at maths but I'm going to risk making a fool of myself :)

The odds for one vnode to be replicated on another node are, in your case, 2/49 
(out of 49 remaining nodes, 2 replicas need to be placed).
Given you have 256 vnodes, the odds for at least one vnode of a single node to 
exist on another one is 256*(2/49) = 10.4%
Since the relationship is bi-directional (there are the same odds for node B to 
have a vnode replicated on node A than the opposite), that doubles the odds of 
2 nodes being both replica for at least one vnode : 20.8%.

Having a smaller number of vnodes will decrease the odds, just as having more 
nodes in the cluster.
(now once again, I hope my maths aren't fully wrong, I'm pretty rusty in that 
area...)

How many queries that will affect is a different question as it depends on 
which partition currently exist and are queried in the unavailable token ranges.

Then you have rack awareness that comes with NetworkTopologyStrategy :
If the number of replicas (3 in your case) is proportional to the number of 
racks, Cassandra will spread replicas in different ones.
In that situation, you can theoretically lose as many nodes as you want in a 
single rack, you will still have two other replicas available to satisfy quorum 
in the remaining racks.
If you start losing nodes in different racks, we're back to doing maths (but 
the odds will get slightly different).

That makes maintenance predictable because you can shut down as many nodes as 
you want in a single rack without losing QUORUM.

Feel free to correct my numbers if I'm wrong.

Cheers,





On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev 
<kyrylo_lebed...@epam.com<mailto:kyrylo_lebed...@epam.com>> wrote:

Thanks, Rahul.

But in your example, at the same time loss of Node3 and Node6 leads to loss of 
ranges N, C, J at consistency level QUORUM.


As far as I understand in case vnodes > N_nodes_in_cluster and 
endpoint_snitch=SimpleSnitch, since:

1) "secondary" replicas are placed on two nodes 'next' to the node responsible 
for a range (in case of RF=3)

2) there are a lot of vnodes on each node
3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,


we get all physical nodes (servers) having mutually adjacent  token rages.
Is it correct?

At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3 for 
this command:

nodetool ring | grep '^<ip-prefix>' | awk '{print $1}' | uniq | grep -B2 -A2 
'<ip_of_a_node>' | grep -v '<ip_of_a_node>' | grep -v '^--' | sort | uniq | wc 
-l

returned number which equals to Nnodes -1, what means that I can't switch off 2 
nodes at the same time w/o losing of some keyrange for CL=QUORUM.


Thanks,

Kyrill

________________________________
From: Rahul Neelakantan <ra...@rahul.be<mailto:ra...@rahul.be>>
Sent: Monday, January 15, 2018 5:20:20 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: vnodes: high availability

Not necessarily. It depends on how the token ranges for the vNodes are assigned 
to them. For example take a look at this diagram
http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html

In the vNode part of the diagram, you will see that Loss of Node 3 and Node 6, 
will still not have any effect on Token Range A. But yes if you lose two nodes 
that both have Token Range A assigned to them (Say Node 1 and Node 2), you will 
have unavailability with your specified configuration.

You can sort of circumvent this by using the DataStax Java Driver and having 
the client recognize a degraded cluster and operate temporarily in downgraded 
consistency mode

http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html

- Rahul

On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev 
<kyrylo_lebed...@epam.com<mailto:kyrylo_lebed...@epam.com>> wrote:

Hi,


Let's say we have a C* cluster with following parameters:

 - 50 nodes in the cluster

 - RF=3

 - vnodes=256 per node

 - CL for some queries = QUORUM

 - endpoint_snitch = SimpleSnitch


Is it correct that 2 any nodes down will cause unavailability of a keyrange at 
CL=QUORUM?


Regards,

Kyrill



--
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com<http://www.thelastpickle.com/>
--
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com<http://www.thelastpickle.com/>



--
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com<http://www.thelastpickle.com/>

Reply via email to