Re: How does a node decide where each of its vnodes will be replicated to?

manish khandelwal Mon, 08 Nov 2021 23:17:38 -0800

Just to add on to your response:
*num_tokens* define the number of vnodes a node can have. Default is 256.
*Initial token* range is predefined (For murmur  -2**63 to 2**63-1)
So if you have one node in (does not make sense) cluster with num_tokens as
256 then you will have 256vnodes. Scaling up will increase the total number
of vnodes linearly with respect to number of nodes in Cassandra cluster. So
for 3 node cluster and num_token as 256, you will have (256*3 vnodes).  All
vnodes will take some chunk from the initial taken range.


Regards
Manish

On Tue, Nov 9, 2021 at 5:55 AM Tech Id <tech.login....@gmail.com> wrote:

> I will try to give a made-up example to show what I understand.
>
> Let us assume our hash function outputs a number between 1 to 10,000
> So hash(primary-key) is between 1 and 10,000
>
> Prior to vnodes, the above 1 to 10k range was split among the nodes.
> With vnodes, this 10k range is now split into say 20 overall vnodes.
> (simplifying with smaller numbers here).
> Hence, vnode is just a logical way to hold 10k/20 = 500 values of hash(ID)
> So vnode1 may hold 1 to 500 values of hash(primary-key), vnode2 may hold
> 501 to 1000 values of hash(ID) and so on.
>
> Now, each node can hold any vnodes.
> So if we have 4 nodes in the above example, each declaring 256 num_tokens,
> then each of them will get an equal number of vnodes = 20/4 = 5 vnodes each.
> The 5 vnodes any node gets, is not contiguous.
> So it could be:
> Node 1 = vn1, vn3, vn10, vn15, vn20
> Node 2 = vn2, vn4, vn5, vn18, vn19
> etc
>
>
> This way, each node now holds small ranges of data, each of those ranges
> is called a vnode.
> That is what I understand from the docs.
>
> On Mon, Nov 8, 2021 at 4:11 PM Jeff Jirsa <jji...@gmail.com> wrote:
>
>> I think your mental model here is trying to map a different db concept
>> (like elasticsearch shards) to a distributed hash table that doesnt really
>> map that way.
>>
>> There's no physical thing as a vnode. Vnode, as a concept, is "a single
>> node runs multiple tokens and owns multiple ranges". Multiple ranges are
>> the "vnodes". There's not a block of data that is a vnode. There's just
>> hosts and ranges they own.
>>
>> On Mon, Nov 8, 2021 at 4:07 PM Tech Id <tech.login....@gmail.com> wrote:
>>
>>> Thanks Jeff.
>>>
>>> One follow-up question please: Each node specifies num_tokens.
>>> So if there are 4 nodes and each specifies 256 tokens, then it means
>>> together they are responsible for 1024 vnodes.
>>> Now, when a fifth node joins and has num_tokens set to 256 as well, then
>>> does the system have 1024+256 = 1280 vnodes?
>>>
>>> Or the number of vnodes remains constant in the system and the nodes
>>> just divide that according to their num_token's weightage?
>>> So in the above example, number of vnodes is say constant at 1000
>>> With 4 nodes each specifying 256 vnodes, every node in reality gets
>>> 1000/4 = 250 vnodes
>>> With 5 nodes each specifying 256 vnodes, every node gets 1000/5 = 200
>>> vnodes
>>>
>>>
>>>
>>> On Mon, Nov 8, 2021 at 3:33 PM Jeff Jirsa <jji...@gmail.com> wrote:
>>>
>>>> When a machine starts for the first time, the joining node basically
>>>> chooses a number of tokens (num_tokens) randomly within the range of the
>>>> partitioner (for murmur3, -2**63 to 2**63), and then bootstraps to claim
>>>> them.
>>>>
>>>> This is sort of a lie, in newer versions, we try to make it a bit more
>>>> deterministic (it tries to ensure an even distribution), but if you just
>>>> think of it as random, it'll make more sense.
>>>>
>>>> The only thing that enforces any meaningful order or distribution here
>>>> is a rack-aware snitch, which will ensure that the RF copies of data land
>>>> on as many racks as possible (which is where it may skip some tokens, if
>>>> they're found to be on the same rack)
>>>>
>>>>
>>>> On Mon, Nov 8, 2021 at 3:22 PM Tech Id <tech.login....@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks Jeff.
>>>>> I think what you explained below is before and after vnodes
>>>>> introduction.
>>>>> The vnodes part is clear - how each node holds a small range of tokens
>>>>> and how each node holds a discontiguous set of vnodes.
>>>>>
>>>>>    1. What is not clear is how each node decided what vnodes it will
>>>>>    get. If it were contiguous, it would have been easy to understand (like
>>>>>    token range).
>>>>>    2. Also the original question of this thread: If each node does
>>>>>    not replicate all its vnodes to the same 2 nodes (assume RF=2), then 
>>>>> how
>>>>>    does it decide where each of its vnode will be replicated to?
>>>>>
>>>>> Maybe the answer to #2 is apparent in #1 answer.
>>>>> But I would really appreciate if someone can help me understand the
>>>>> above.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Nov 8, 2021 at 2:00 PM Jeff Jirsa <jji...@gmail.com> wrote:
>>>>>
>>>>>> Vnodes are implemented by giving a single process multiple tokens.
>>>>>>
>>>>>> Tokens ultimately determine which data lives on which node. When you
>>>>>> hash a partition key, it gives you a token (let's say 570). The 3 
>>>>>> processes
>>>>>> that own token 57 are the next 3 tokens in the ring ABOVE 570, so if you
>>>>>> had
>>>>>> A = 0
>>>>>> B = 1000
>>>>>> C = 2000
>>>>>> D = 3000
>>>>>> E = 4000
>>>>>>
>>>>>> The replicas for data for token=570 are B,C,D
>>>>>>
>>>>>>
>>>>>> When you have vnodes and there's lots of tokens (from the same small
>>>>>> set of 5 hosts), it'd look closer to:
>>>>>> A = 0
>>>>>> C = 100
>>>>>> A = 300
>>>>>> B = 700
>>>>>> D = 800
>>>>>> B = 1000
>>>>>> D = 1300
>>>>>> C = 1700
>>>>>> B = 1800
>>>>>> C = 2000
>>>>>> E = 2100
>>>>>> B = 2400
>>>>>> A = 2900
>>>>>> D = 3000
>>>>>> E = 4000
>>>>>>
>>>>>> In this case, the replicas for token=570 are B, D and C (it would go
>>>>>> B, D, B, D, but we would de-duplicate the B and D and look for the next
>>>>>> non-B/non-D host.= D at 1700)
>>>>>>
>>>>>> If you want to see a view of this in your own cluster, use `nodetool
>>>>>> ring` to see the full token ring.
>>>>>>
>>>>>> There's no desire to enforce a replication mapping where all data on
>>>>>> A is replicated to the same set of replicas of A, because the point of
>>>>>> vnodes is to give A many distinct replicas so when you replace A, it can
>>>>>> replicate from "many" other sources (maybe a dozen, maybe a hundred). 
>>>>>> This
>>>>>> was super important before 4.0, because each replication stream was 
>>>>>> single
>>>>>> threaded by SENDER, so vnodes let you use more than 2-3 cores to
>>>>>> re-replicate (in 4.0, it's still single threaded, but we avoid a lot of
>>>>>> deserialization so we can saturate a nic with only a few cores, that was
>>>>>> much harder to do before).
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 8, 2021 at 1:44 PM Tech Id <tech.login....@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> Going through
>>>>>>> https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/architecture/archDataDistributeDistribute.html
>>>>>>> .
>>>>>>>
>>>>>>> But it is not clear how a node decides where each of its vnodes will
>>>>>>> be replicated to.
>>>>>>>
>>>>>>> As an example from the above page:
>>>>>>>
>>>>>>>    1. Why is vnode A present in nodes 1,2 and 5
>>>>>>>    2. BUT vnode B is present in nodes 1,4 and 6
>>>>>>>
>>>>>>>
>>>>>>> I realize that the diagram is for illustration purposes only, but
>>>>>>> the idea being conveyed should nevertheless be the same as I suggested
>>>>>>> above.
>>>>>>>
>>>>>>> So how come node 1 decides to put A on itself, 2 and 5 but put B on
>>>>>>> itself, 4 and 6 ?
>>>>>>> Shouldn't there be consistency here such that all vnodes present on
>>>>>>> A are replicated to same set of other nodes?
>>>>>>>
>>>>>>> Any clarifications on that would be appreciated.
>>>>>>>
>>>>>>> I also understand that different vnodes are replicated to different
>>>>>>> nodes for performance.
>>>>>>> But all I want to know is the algorithm that it uses to put them on
>>>>>>> different nodes.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>

Re: How does a node decide where each of its vnodes will be replicated to?

Reply via email to