I think your mental model here is trying to map a different db concept (like elasticsearch shards) to a distributed hash table that doesnt really map that way.
There's no physical thing as a vnode. Vnode, as a concept, is "a single node runs multiple tokens and owns multiple ranges". Multiple ranges are the "vnodes". There's not a block of data that is a vnode. There's just hosts and ranges they own. On Mon, Nov 8, 2021 at 4:07 PM Tech Id <tech.login....@gmail.com> wrote: > Thanks Jeff. > > One follow-up question please: Each node specifies num_tokens. > So if there are 4 nodes and each specifies 256 tokens, then it means > together they are responsible for 1024 vnodes. > Now, when a fifth node joins and has num_tokens set to 256 as well, then > does the system have 1024+256 = 1280 vnodes? > > Or the number of vnodes remains constant in the system and the nodes just > divide that according to their num_token's weightage? > So in the above example, number of vnodes is say constant at 1000 > With 4 nodes each specifying 256 vnodes, every node in reality gets 1000/4 > = 250 vnodes > With 5 nodes each specifying 256 vnodes, every node gets 1000/5 = 200 > vnodes > > > > On Mon, Nov 8, 2021 at 3:33 PM Jeff Jirsa <jji...@gmail.com> wrote: > >> When a machine starts for the first time, the joining node basically >> chooses a number of tokens (num_tokens) randomly within the range of the >> partitioner (for murmur3, -2**63 to 2**63), and then bootstraps to claim >> them. >> >> This is sort of a lie, in newer versions, we try to make it a bit more >> deterministic (it tries to ensure an even distribution), but if you just >> think of it as random, it'll make more sense. >> >> The only thing that enforces any meaningful order or distribution here is >> a rack-aware snitch, which will ensure that the RF copies of data land on >> as many racks as possible (which is where it may skip some tokens, if >> they're found to be on the same rack) >> >> >> On Mon, Nov 8, 2021 at 3:22 PM Tech Id <tech.login....@gmail.com> wrote: >> >>> Thanks Jeff. >>> I think what you explained below is before and after vnodes introduction. >>> The vnodes part is clear - how each node holds a small range of tokens >>> and how each node holds a discontiguous set of vnodes. >>> >>> 1. What is not clear is how each node decided what vnodes it will >>> get. If it were contiguous, it would have been easy to understand (like >>> token range). >>> 2. Also the original question of this thread: If each node does not >>> replicate all its vnodes to the same 2 nodes (assume RF=2), then how does >>> it decide where each of its vnode will be replicated to? >>> >>> Maybe the answer to #2 is apparent in #1 answer. >>> But I would really appreciate if someone can help me understand the >>> above. >>> >>> >>> >>> On Mon, Nov 8, 2021 at 2:00 PM Jeff Jirsa <jji...@gmail.com> wrote: >>> >>>> Vnodes are implemented by giving a single process multiple tokens. >>>> >>>> Tokens ultimately determine which data lives on which node. When you >>>> hash a partition key, it gives you a token (let's say 570). The 3 processes >>>> that own token 57 are the next 3 tokens in the ring ABOVE 570, so if you >>>> had >>>> A = 0 >>>> B = 1000 >>>> C = 2000 >>>> D = 3000 >>>> E = 4000 >>>> >>>> The replicas for data for token=570 are B,C,D >>>> >>>> >>>> When you have vnodes and there's lots of tokens (from the same small >>>> set of 5 hosts), it'd look closer to: >>>> A = 0 >>>> C = 100 >>>> A = 300 >>>> B = 700 >>>> D = 800 >>>> B = 1000 >>>> D = 1300 >>>> C = 1700 >>>> B = 1800 >>>> C = 2000 >>>> E = 2100 >>>> B = 2400 >>>> A = 2900 >>>> D = 3000 >>>> E = 4000 >>>> >>>> In this case, the replicas for token=570 are B, D and C (it would go B, >>>> D, B, D, but we would de-duplicate the B and D and look for the next >>>> non-B/non-D host.= D at 1700) >>>> >>>> If you want to see a view of this in your own cluster, use `nodetool >>>> ring` to see the full token ring. >>>> >>>> There's no desire to enforce a replication mapping where all data on A >>>> is replicated to the same set of replicas of A, because the point of vnodes >>>> is to give A many distinct replicas so when you replace A, it can replicate >>>> from "many" other sources (maybe a dozen, maybe a hundred). This was super >>>> important before 4.0, because each replication stream was single threaded >>>> by SENDER, so vnodes let you use more than 2-3 cores to re-replicate (in >>>> 4.0, it's still single threaded, but we avoid a lot of deserialization so >>>> we can saturate a nic with only a few cores, that was much harder to do >>>> before). >>>> >>>> >>>> On Mon, Nov 8, 2021 at 1:44 PM Tech Id <tech.login....@gmail.com> >>>> wrote: >>>> >>>>> >>>>> Hello, >>>>> >>>>> Going through >>>>> https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/architecture/archDataDistributeDistribute.html >>>>> . >>>>> >>>>> But it is not clear how a node decides where each of its vnodes will >>>>> be replicated to. >>>>> >>>>> As an example from the above page: >>>>> >>>>> 1. Why is vnode A present in nodes 1,2 and 5 >>>>> 2. BUT vnode B is present in nodes 1,4 and 6 >>>>> >>>>> >>>>> I realize that the diagram is for illustration purposes only, but the >>>>> idea being conveyed should nevertheless be the same as I suggested above. >>>>> >>>>> So how come node 1 decides to put A on itself, 2 and 5 but put B on >>>>> itself, 4 and 6 ? >>>>> Shouldn't there be consistency here such that all vnodes present on A >>>>> are replicated to same set of other nodes? >>>>> >>>>> Any clarifications on that would be appreciated. >>>>> >>>>> I also understand that different vnodes are replicated to different >>>>> nodes for performance. >>>>> But all I want to know is the algorithm that it uses to put them on >>>>> different nodes. >>>>> >>>>> Thanks! >>>>> >>>>>