Just need to correct one thing: the num_tokens default /*WAS*/ 256. The new default <https://github.com/apache/cassandra/blob/6709111ed007a54b3e42884853f89cabd38e4316/conf/cassandra.yaml#L27> is 16 in Cassandra 4.

On 09/11/2021 07:17, manish khandelwal wrote:
Just to add on to your response:
*num_tokens* define the number of vnodes a node can have. Default is 256.
*Initial token* range is predefined (For murmur -2**63 to 2**63-1)
So if you have one node in (does not make sense) cluster with num_tokens as 256 then you will have 256vnodes. Scaling up will increase the total number of vnodes linearly with respect to number of nodes in Cassandra cluster. So for 3 node cluster and num_token as 256, you will have (256*3 vnodes). All vnodes will take some chunk from the initial taken range.

Regards
Manish

On Tue, Nov 9, 2021 at 5:55 AM Tech Id <tech.login....@gmail.com> wrote:

    I will try to give a made-up example to show what I understand.

    Let us assume our hash function outputs a number between 1 to 10,000
    So hash(primary-key) is between 1 and 10,000

    Prior to vnodes, the above 1 to 10k range was split among the nodes.
    With vnodes, this 10k range is now split into say 20 overall
    vnodes. (simplifying with smaller numbers here).
    Hence, vnode is just a logical way to hold 10k/20 = 500 values of
    hash(ID)
    So vnode1 may hold 1 to 500 values of hash(primary-key), vnode2
    may hold 501 to 1000 values of hash(ID) and so on.

    Now, each node can hold any vnodes.
    So if we have 4 nodes in the above example, each declaring 256
    num_tokens, then each of them will get an equal number of vnodes =
    20/4 = 5 vnodes each.
    The 5 vnodes any node gets, is not contiguous.
    So it could be:
    Node 1 = vn1, vn3, vn10, vn15, vn20
    Node 2 = vn2, vn4, vn5, vn18, vn19
    etc


    This way, each node now holds small ranges of data, each of those
    ranges is called a vnode.
    That is what I understand from the docs.

    On Mon, Nov 8, 2021 at 4:11 PM Jeff Jirsa <jji...@gmail.com> wrote:

        I think your mental model here is trying to map a different db
        concept (like elasticsearch shards) to a distributed hash
        table that doesnt really map that way.

        There's no physical thing as a vnode. Vnode, as a concept, is
        "a single node runs multiple tokens and owns multiple ranges".
        Multiple ranges are the "vnodes". There's not a block of data
        that is a vnode. There's just hosts and ranges they own.

        On Mon, Nov 8, 2021 at 4:07 PM Tech Id
        <tech.login....@gmail.com> wrote:

            Thanks Jeff.

            One follow-up question please: Each node specifies num_tokens.
            So if there are 4 nodes and each specifies 256 tokens,
            then it means together they are responsible for 1024 vnodes.
            Now, when a fifth node joins and has num_tokens set to 256
            as well, then does the system have 1024+256 = 1280 vnodes?

            Or the number of vnodes remains constant in the system and
            the nodes just divide that according to their num_token's
            weightage?
            So in the above example, number of vnodes is say constant
            at 1000
            With 4 nodes each specifying 256 vnodes, every node in
            reality gets 1000/4 = 250 vnodes
            With 5 nodes each specifying 256 vnodes, every node gets
            1000/5 = 200 vnodes



            On Mon, Nov 8, 2021 at 3:33 PM Jeff Jirsa
            <jji...@gmail.com> wrote:

                When a machine starts for the first time, the joining
                node basically chooses a number of tokens (num_tokens)
                randomly within the range of the partitioner (for
                murmur3, -2**63 to 2**63), and then bootstraps to
                claim them.

                This is sort of a lie, in newer versions, we try to
                make it a bit more deterministic (it tries to ensure
                an even distribution), but if you just think of it as
                random, it'll make more sense.

                The only thing that enforces any meaningful order or
                distribution here is a rack-aware snitch, which will
                ensure that the RF copies of data land on as many
                racks as possible (which is where it may skip some
                tokens, if they're found to be on the same rack)


                On Mon, Nov 8, 2021 at 3:22 PM Tech Id
                <tech.login....@gmail.com> wrote:

                    Thanks Jeff.
                    I think what you explained below is before and
                    after vnodes introduction.
                    The vnodes part is clear - how each node holds a
                    small range of tokens and how each node holds a
                    discontiguous set of vnodes.

                     1. What is not clear is how each node decided
                        what vnodes it will get. If it were
                        contiguous, it would have been easy to
                        understand (like token range).
                     2. Also the original question of this thread: If
                        each node does not replicate all its vnodes to
                        the same 2 nodes (assume RF=2), then how does
                        it decide where each of its vnode will be
                        replicated to?

                    Maybe the answer to #2 is apparent in #1 answer.
                    But I would really appreciate if someone can help
                    me understand the above.



                    On Mon, Nov 8, 2021 at 2:00 PM Jeff Jirsa
                    <jji...@gmail.com> wrote:

                        Vnodes are implemented by giving a single
                        process multiple tokens.

                        Tokens ultimately determine which data lives
                        on which node. When you hash a partition key,
                        it gives you a token (let's say 570). The 3
                        processes that own token 57 are the next 3
                        tokens in the ring ABOVE 570, so if you had
                        A = 0
                        B = 1000
                        C = 2000
                        D = 3000
                        E = 4000

                        The replicas for data for token=570 are B,C,D


                        When you have vnodes and there's lots of
                        tokens (from the same small set of 5 hosts),
                        it'd look closer to:
                        A = 0
                        C = 100
                        A = 300
                        B = 700
                        D = 800
                        B = 1000
                        D = 1300
                        C = 1700
                        B = 1800
                        C = 2000
                        E = 2100
                        B = 2400
                        A = 2900
                        D = 3000
                        E = 4000

                        In this case, the replicas for token=570 are
                        B, D and C (it would go B, D, B, D, but we
                        would de-duplicate the B and D and look for
                        the next non-B/non-D host.= D at 1700)

                        If you want to see a view of this in your own
                        cluster, use `nodetool ring` to see the full
                        token ring.

                        There's no desire to enforce a replication
                        mapping where all data on A is replicated to
                        the same set of replicas of A, because the
                        point of vnodes is to give A many distinct
                        replicas so when you replace A, it can
                        replicate from "many" other sources (maybe a
                        dozen, maybe a hundred). This was super
                        important before 4.0, because each replication
                        stream was single threaded by SENDER, so
                        vnodes let you use more than 2-3 cores to
                        re-replicate (in 4.0, it's still single
                        threaded, but we avoid a lot of
                        deserialization so we can saturate a nic with
                        only a few cores, that was much harder to do
                        before).


                        On Mon, Nov 8, 2021 at 1:44 PM Tech Id
                        <tech.login....@gmail.com> wrote:


                            Hello,

                            Going through
                            
https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/architecture/archDataDistributeDistribute.html.

                            But it is not clear how a node decides
                            where each of its vnodes will be
                            replicated to.

                            As an example from the above page:

                             1. Why is vnode A present in nodes 1,2 and 5
                             2. BUT vnode B is present in nodes 1,4 and 6


                            I realize that the diagram is for
                            illustration purposes only, but the idea
                            being conveyed should nevertheless be the
                            same as I suggested above.

                            So how come node 1 decides to put A on
                            itself, 2 and 5 but put B on itself, 4 and 6 ?
                            Shouldn't there be consistency here such
                            that all vnodes present on A are
                            replicated to same set of other nodes?

                            Any clarifications on that would be
                            appreciated.

                            I also understand that different vnodes
                            are replicated to different nodes for
                            performance.
                            But all I want to know is the algorithm
                            that it uses to put them on different nodes.

                            Thanks!

Reply via email to