Re: How does a node decide where each of its vnodes will be replicated to?

Jeff Jirsa Mon, 08 Nov 2021 16:10:52 -0800

I think your mental model here is trying to map a different db concept
(like elasticsearch shards) to a distributed hash table that doesnt really
map that way.


There's no physical thing as a vnode. Vnode, as a concept, is "a single
node runs multiple tokens and owns multiple ranges". Multiple ranges are
the "vnodes". There's not a block of data that is a vnode. There's just
hosts and ranges they own.

On Mon, Nov 8, 2021 at 4:07 PM Tech Id <tech.login....@gmail.com> wrote:

> Thanks Jeff.
>
> One follow-up question please: Each node specifies num_tokens.
> So if there are 4 nodes and each specifies 256 tokens, then it means
> together they are responsible for 1024 vnodes.
> Now, when a fifth node joins and has num_tokens set to 256 as well, then
> does the system have 1024+256 = 1280 vnodes?
>
> Or the number of vnodes remains constant in the system and the nodes just
> divide that according to their num_token's weightage?
> So in the above example, number of vnodes is say constant at 1000
> With 4 nodes each specifying 256 vnodes, every node in reality gets 1000/4
> = 250 vnodes
> With 5 nodes each specifying 256 vnodes, every node gets 1000/5 = 200
> vnodes
>
>
>
> On Mon, Nov 8, 2021 at 3:33 PM Jeff Jirsa <jji...@gmail.com> wrote:
>
>> When a machine starts for the first time, the joining node basically
>> chooses a number of tokens (num_tokens) randomly within the range of the
>> partitioner (for murmur3, -2**63 to 2**63), and then bootstraps to claim
>> them.
>>
>> This is sort of a lie, in newer versions, we try to make it a bit more
>> deterministic (it tries to ensure an even distribution), but if you just
>> think of it as random, it'll make more sense.
>>
>> The only thing that enforces any meaningful order or distribution here is
>> a rack-aware snitch, which will ensure that the RF copies of data land on
>> as many racks as possible (which is where it may skip some tokens, if
>> they're found to be on the same rack)
>>
>>
>> On Mon, Nov 8, 2021 at 3:22 PM Tech Id <tech.login....@gmail.com> wrote:
>>
>>> Thanks Jeff.
>>> I think what you explained below is before and after vnodes introduction.
>>> The vnodes part is clear - how each node holds a small range of tokens
>>> and how each node holds a discontiguous set of vnodes.
>>>
>>>    1. What is not clear is how each node decided what vnodes it will
>>>    get. If it were contiguous, it would have been easy to understand (like
>>>    token range).
>>>    2. Also the original question of this thread: If each node does not
>>>    replicate all its vnodes to the same 2 nodes (assume RF=2), then how does
>>>    it decide where each of its vnode will be replicated to?
>>>
>>> Maybe the answer to #2 is apparent in #1 answer.
>>> But I would really appreciate if someone can help me understand the
>>> above.
>>>
>>>
>>>
>>> On Mon, Nov 8, 2021 at 2:00 PM Jeff Jirsa <jji...@gmail.com> wrote:
>>>
>>>> Vnodes are implemented by giving a single process multiple tokens.
>>>>
>>>> Tokens ultimately determine which data lives on which node. When you
>>>> hash a partition key, it gives you a token (let's say 570). The 3 processes
>>>> that own token 57 are the next 3 tokens in the ring ABOVE 570, so if you
>>>> had
>>>> A = 0
>>>> B = 1000
>>>> C = 2000
>>>> D = 3000
>>>> E = 4000
>>>>
>>>> The replicas for data for token=570 are B,C,D
>>>>
>>>>
>>>> When you have vnodes and there's lots of tokens (from the same small
>>>> set of 5 hosts), it'd look closer to:
>>>> A = 0
>>>> C = 100
>>>> A = 300
>>>> B = 700
>>>> D = 800
>>>> B = 1000
>>>> D = 1300
>>>> C = 1700
>>>> B = 1800
>>>> C = 2000
>>>> E = 2100
>>>> B = 2400
>>>> A = 2900
>>>> D = 3000
>>>> E = 4000
>>>>
>>>> In this case, the replicas for token=570 are B, D and C (it would go B,
>>>> D, B, D, but we would de-duplicate the B and D and look for the next
>>>> non-B/non-D host.= D at 1700)
>>>>
>>>> If you want to see a view of this in your own cluster, use `nodetool
>>>> ring` to see the full token ring.
>>>>
>>>> There's no desire to enforce a replication mapping where all data on A
>>>> is replicated to the same set of replicas of A, because the point of vnodes
>>>> is to give A many distinct replicas so when you replace A, it can replicate
>>>> from "many" other sources (maybe a dozen, maybe a hundred). This was super
>>>> important before 4.0, because each replication stream was single threaded
>>>> by SENDER, so vnodes let you use more than 2-3 cores to re-replicate (in
>>>> 4.0, it's still single threaded, but we avoid a lot of deserialization so
>>>> we can saturate a nic with only a few cores, that was much harder to do
>>>> before).
>>>>
>>>>
>>>> On Mon, Nov 8, 2021 at 1:44 PM Tech Id <tech.login....@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> Hello,
>>>>>
>>>>> Going through
>>>>> https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/architecture/archDataDistributeDistribute.html
>>>>> .
>>>>>
>>>>> But it is not clear how a node decides where each of its vnodes will
>>>>> be replicated to.
>>>>>
>>>>> As an example from the above page:
>>>>>
>>>>>    1. Why is vnode A present in nodes 1,2 and 5
>>>>>    2. BUT vnode B is present in nodes 1,4 and 6
>>>>>
>>>>>
>>>>> I realize that the diagram is for illustration purposes only, but the
>>>>> idea being conveyed should nevertheless be the same as I suggested above.
>>>>>
>>>>> So how come node 1 decides to put A on itself, 2 and 5 but put B on
>>>>> itself, 4 and 6 ?
>>>>> Shouldn't there be consistency here such that all vnodes present on A
>>>>> are replicated to same set of other nodes?
>>>>>
>>>>> Any clarifications on that would be appreciated.
>>>>>
>>>>> I also understand that different vnodes are replicated to different
>>>>> nodes for performance.
>>>>> But all I want to know is the algorithm that it uses to put them on
>>>>> different nodes.
>>>>>
>>>>> Thanks!
>>>>>
>>>>>

Re: How does a node decide where each of its vnodes will be replicated to?

Reply via email to