Just to add on to your response: *num_tokens* define the number of vnodes a node can have. Default is 256. *Initial token* range is predefined (For murmur -2**63 to 2**63-1) So if you have one node in (does not make sense) cluster with num_tokens as 256 then you will have 256vnodes. Scaling up will increase the total number of vnodes linearly with respect to number of nodes in Cassandra cluster. So for 3 node cluster and num_token as 256, you will have (256*3 vnodes). All vnodes will take some chunk from the initial taken range.
Regards Manish On Tue, Nov 9, 2021 at 5:55 AM Tech Id <tech.login....@gmail.com> wrote: > I will try to give a made-up example to show what I understand. > > Let us assume our hash function outputs a number between 1 to 10,000 > So hash(primary-key) is between 1 and 10,000 > > Prior to vnodes, the above 1 to 10k range was split among the nodes. > With vnodes, this 10k range is now split into say 20 overall vnodes. > (simplifying with smaller numbers here). > Hence, vnode is just a logical way to hold 10k/20 = 500 values of hash(ID) > So vnode1 may hold 1 to 500 values of hash(primary-key), vnode2 may hold > 501 to 1000 values of hash(ID) and so on. > > Now, each node can hold any vnodes. > So if we have 4 nodes in the above example, each declaring 256 num_tokens, > then each of them will get an equal number of vnodes = 20/4 = 5 vnodes each. > The 5 vnodes any node gets, is not contiguous. > So it could be: > Node 1 = vn1, vn3, vn10, vn15, vn20 > Node 2 = vn2, vn4, vn5, vn18, vn19 > etc > > > This way, each node now holds small ranges of data, each of those ranges > is called a vnode. > That is what I understand from the docs. > > On Mon, Nov 8, 2021 at 4:11 PM Jeff Jirsa <jji...@gmail.com> wrote: > >> I think your mental model here is trying to map a different db concept >> (like elasticsearch shards) to a distributed hash table that doesnt really >> map that way. >> >> There's no physical thing as a vnode. Vnode, as a concept, is "a single >> node runs multiple tokens and owns multiple ranges". Multiple ranges are >> the "vnodes". There's not a block of data that is a vnode. There's just >> hosts and ranges they own. >> >> On Mon, Nov 8, 2021 at 4:07 PM Tech Id <tech.login....@gmail.com> wrote: >> >>> Thanks Jeff. >>> >>> One follow-up question please: Each node specifies num_tokens. >>> So if there are 4 nodes and each specifies 256 tokens, then it means >>> together they are responsible for 1024 vnodes. >>> Now, when a fifth node joins and has num_tokens set to 256 as well, then >>> does the system have 1024+256 = 1280 vnodes? >>> >>> Or the number of vnodes remains constant in the system and the nodes >>> just divide that according to their num_token's weightage? >>> So in the above example, number of vnodes is say constant at 1000 >>> With 4 nodes each specifying 256 vnodes, every node in reality gets >>> 1000/4 = 250 vnodes >>> With 5 nodes each specifying 256 vnodes, every node gets 1000/5 = 200 >>> vnodes >>> >>> >>> >>> On Mon, Nov 8, 2021 at 3:33 PM Jeff Jirsa <jji...@gmail.com> wrote: >>> >>>> When a machine starts for the first time, the joining node basically >>>> chooses a number of tokens (num_tokens) randomly within the range of the >>>> partitioner (for murmur3, -2**63 to 2**63), and then bootstraps to claim >>>> them. >>>> >>>> This is sort of a lie, in newer versions, we try to make it a bit more >>>> deterministic (it tries to ensure an even distribution), but if you just >>>> think of it as random, it'll make more sense. >>>> >>>> The only thing that enforces any meaningful order or distribution here >>>> is a rack-aware snitch, which will ensure that the RF copies of data land >>>> on as many racks as possible (which is where it may skip some tokens, if >>>> they're found to be on the same rack) >>>> >>>> >>>> On Mon, Nov 8, 2021 at 3:22 PM Tech Id <tech.login....@gmail.com> >>>> wrote: >>>> >>>>> Thanks Jeff. >>>>> I think what you explained below is before and after vnodes >>>>> introduction. >>>>> The vnodes part is clear - how each node holds a small range of tokens >>>>> and how each node holds a discontiguous set of vnodes. >>>>> >>>>> 1. What is not clear is how each node decided what vnodes it will >>>>> get. If it were contiguous, it would have been easy to understand (like >>>>> token range). >>>>> 2. Also the original question of this thread: If each node does >>>>> not replicate all its vnodes to the same 2 nodes (assume RF=2), then >>>>> how >>>>> does it decide where each of its vnode will be replicated to? >>>>> >>>>> Maybe the answer to #2 is apparent in #1 answer. >>>>> But I would really appreciate if someone can help me understand the >>>>> above. >>>>> >>>>> >>>>> >>>>> On Mon, Nov 8, 2021 at 2:00 PM Jeff Jirsa <jji...@gmail.com> wrote: >>>>> >>>>>> Vnodes are implemented by giving a single process multiple tokens. >>>>>> >>>>>> Tokens ultimately determine which data lives on which node. When you >>>>>> hash a partition key, it gives you a token (let's say 570). The 3 >>>>>> processes >>>>>> that own token 57 are the next 3 tokens in the ring ABOVE 570, so if you >>>>>> had >>>>>> A = 0 >>>>>> B = 1000 >>>>>> C = 2000 >>>>>> D = 3000 >>>>>> E = 4000 >>>>>> >>>>>> The replicas for data for token=570 are B,C,D >>>>>> >>>>>> >>>>>> When you have vnodes and there's lots of tokens (from the same small >>>>>> set of 5 hosts), it'd look closer to: >>>>>> A = 0 >>>>>> C = 100 >>>>>> A = 300 >>>>>> B = 700 >>>>>> D = 800 >>>>>> B = 1000 >>>>>> D = 1300 >>>>>> C = 1700 >>>>>> B = 1800 >>>>>> C = 2000 >>>>>> E = 2100 >>>>>> B = 2400 >>>>>> A = 2900 >>>>>> D = 3000 >>>>>> E = 4000 >>>>>> >>>>>> In this case, the replicas for token=570 are B, D and C (it would go >>>>>> B, D, B, D, but we would de-duplicate the B and D and look for the next >>>>>> non-B/non-D host.= D at 1700) >>>>>> >>>>>> If you want to see a view of this in your own cluster, use `nodetool >>>>>> ring` to see the full token ring. >>>>>> >>>>>> There's no desire to enforce a replication mapping where all data on >>>>>> A is replicated to the same set of replicas of A, because the point of >>>>>> vnodes is to give A many distinct replicas so when you replace A, it can >>>>>> replicate from "many" other sources (maybe a dozen, maybe a hundred). >>>>>> This >>>>>> was super important before 4.0, because each replication stream was >>>>>> single >>>>>> threaded by SENDER, so vnodes let you use more than 2-3 cores to >>>>>> re-replicate (in 4.0, it's still single threaded, but we avoid a lot of >>>>>> deserialization so we can saturate a nic with only a few cores, that was >>>>>> much harder to do before). >>>>>> >>>>>> >>>>>> On Mon, Nov 8, 2021 at 1:44 PM Tech Id <tech.login....@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> Going through >>>>>>> https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/architecture/archDataDistributeDistribute.html >>>>>>> . >>>>>>> >>>>>>> But it is not clear how a node decides where each of its vnodes will >>>>>>> be replicated to. >>>>>>> >>>>>>> As an example from the above page: >>>>>>> >>>>>>> 1. Why is vnode A present in nodes 1,2 and 5 >>>>>>> 2. BUT vnode B is present in nodes 1,4 and 6 >>>>>>> >>>>>>> >>>>>>> I realize that the diagram is for illustration purposes only, but >>>>>>> the idea being conveyed should nevertheless be the same as I suggested >>>>>>> above. >>>>>>> >>>>>>> So how come node 1 decides to put A on itself, 2 and 5 but put B on >>>>>>> itself, 4 and 6 ? >>>>>>> Shouldn't there be consistency here such that all vnodes present on >>>>>>> A are replicated to same set of other nodes? >>>>>>> >>>>>>> Any clarifications on that would be appreciated. >>>>>>> >>>>>>> I also understand that different vnodes are replicated to different >>>>>>> nodes for performance. >>>>>>> But all I want to know is the algorithm that it uses to put them on >>>>>>> different nodes. >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>>