Just to add on to your response:
*num_tokens* define the number of vnodes a node can have. Default is 256.
*Initial token* range is predefined (For murmur -2**63 to 2**63-1)
So if you have one node in (does not make sense) cluster with
num_tokens as 256 then you will have 256vnodes. Scaling up will
increase the total number of vnodes linearly with respect to number of
nodes in Cassandra cluster. So for 3 node cluster and num_token as
256, you will have (256*3 vnodes). All vnodes will take some chunk
from the initial taken range.
Regards
Manish
On Tue, Nov 9, 2021 at 5:55 AM Tech Id <tech.login....@gmail.com> wrote:
I will try to give a made-up example to show what I understand.
Let us assume our hash function outputs a number between 1 to 10,000
So hash(primary-key) is between 1 and 10,000
Prior to vnodes, the above 1 to 10k range was split among the nodes.
With vnodes, this 10k range is now split into say 20 overall
vnodes. (simplifying with smaller numbers here).
Hence, vnode is just a logical way to hold 10k/20 = 500 values of
hash(ID)
So vnode1 may hold 1 to 500 values of hash(primary-key), vnode2
may hold 501 to 1000 values of hash(ID) and so on.
Now, each node can hold any vnodes.
So if we have 4 nodes in the above example, each declaring 256
num_tokens, then each of them will get an equal number of vnodes =
20/4 = 5 vnodes each.
The 5 vnodes any node gets, is not contiguous.
So it could be:
Node 1 = vn1, vn3, vn10, vn15, vn20
Node 2 = vn2, vn4, vn5, vn18, vn19
etc
This way, each node now holds small ranges of data, each of those
ranges is called a vnode.
That is what I understand from the docs.
On Mon, Nov 8, 2021 at 4:11 PM Jeff Jirsa <jji...@gmail.com> wrote:
I think your mental model here is trying to map a different db
concept (like elasticsearch shards) to a distributed hash
table that doesnt really map that way.
There's no physical thing as a vnode. Vnode, as a concept, is
"a single node runs multiple tokens and owns multiple ranges".
Multiple ranges are the "vnodes". There's not a block of data
that is a vnode. There's just hosts and ranges they own.
On Mon, Nov 8, 2021 at 4:07 PM Tech Id
<tech.login....@gmail.com> wrote:
Thanks Jeff.
One follow-up question please: Each node specifies num_tokens.
So if there are 4 nodes and each specifies 256 tokens,
then it means together they are responsible for 1024 vnodes.
Now, when a fifth node joins and has num_tokens set to 256
as well, then does the system have 1024+256 = 1280 vnodes?
Or the number of vnodes remains constant in the system and
the nodes just divide that according to their num_token's
weightage?
So in the above example, number of vnodes is say constant
at 1000
With 4 nodes each specifying 256 vnodes, every node in
reality gets 1000/4 = 250 vnodes
With 5 nodes each specifying 256 vnodes, every node gets
1000/5 = 200 vnodes
On Mon, Nov 8, 2021 at 3:33 PM Jeff Jirsa
<jji...@gmail.com> wrote:
When a machine starts for the first time, the joining
node basically chooses a number of tokens (num_tokens)
randomly within the range of the partitioner (for
murmur3, -2**63 to 2**63), and then bootstraps to
claim them.
This is sort of a lie, in newer versions, we try to
make it a bit more deterministic (it tries to ensure
an even distribution), but if you just think of it as
random, it'll make more sense.
The only thing that enforces any meaningful order or
distribution here is a rack-aware snitch, which will
ensure that the RF copies of data land on as many
racks as possible (which is where it may skip some
tokens, if they're found to be on the same rack)
On Mon, Nov 8, 2021 at 3:22 PM Tech Id
<tech.login....@gmail.com> wrote:
Thanks Jeff.
I think what you explained below is before and
after vnodes introduction.
The vnodes part is clear - how each node holds a
small range of tokens and how each node holds a
discontiguous set of vnodes.
1. What is not clear is how each node decided
what vnodes it will get. If it were
contiguous, it would have been easy to
understand (like token range).
2. Also the original question of this thread: If
each node does not replicate all its vnodes to
the same 2 nodes (assume RF=2), then how does
it decide where each of its vnode will be
replicated to?
Maybe the answer to #2 is apparent in #1 answer.
But I would really appreciate if someone can help
me understand the above.
On Mon, Nov 8, 2021 at 2:00 PM Jeff Jirsa
<jji...@gmail.com> wrote:
Vnodes are implemented by giving a single
process multiple tokens.
Tokens ultimately determine which data lives
on which node. When you hash a partition key,
it gives you a token (let's say 570). The 3
processes that own token 57 are the next 3
tokens in the ring ABOVE 570, so if you had
A = 0
B = 1000
C = 2000
D = 3000
E = 4000
The replicas for data for token=570 are B,C,D
When you have vnodes and there's lots of
tokens (from the same small set of 5 hosts),
it'd look closer to:
A = 0
C = 100
A = 300
B = 700
D = 800
B = 1000
D = 1300
C = 1700
B = 1800
C = 2000
E = 2100
B = 2400
A = 2900
D = 3000
E = 4000
In this case, the replicas for token=570 are
B, D and C (it would go B, D, B, D, but we
would de-duplicate the B and D and look for
the next non-B/non-D host.= D at 1700)
If you want to see a view of this in your own
cluster, use `nodetool ring` to see the full
token ring.
There's no desire to enforce a replication
mapping where all data on A is replicated to
the same set of replicas of A, because the
point of vnodes is to give A many distinct
replicas so when you replace A, it can
replicate from "many" other sources (maybe a
dozen, maybe a hundred). This was super
important before 4.0, because each replication
stream was single threaded by SENDER, so
vnodes let you use more than 2-3 cores to
re-replicate (in 4.0, it's still single
threaded, but we avoid a lot of
deserialization so we can saturate a nic with
only a few cores, that was much harder to do
before).
On Mon, Nov 8, 2021 at 1:44 PM Tech Id
<tech.login....@gmail.com> wrote:
Hello,
Going through
https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/architecture/archDataDistributeDistribute.html.
But it is not clear how a node decides
where each of its vnodes will be
replicated to.
As an example from the above page:
1. Why is vnode A present in nodes 1,2 and 5
2. BUT vnode B is present in nodes 1,4 and 6
I realize that the diagram is for
illustration purposes only, but the idea
being conveyed should nevertheless be the
same as I suggested above.
So how come node 1 decides to put A on
itself, 2 and 5 but put B on itself, 4 and 6 ?
Shouldn't there be consistency here such
that all vnodes present on A are
replicated to same set of other nodes?
Any clarifications on that would be
appreciated.
I also understand that different vnodes
are replicated to different nodes for
performance.
But all I want to know is the algorithm
that it uses to put them on different nodes.
Thanks!