On the flip side, a large number of vnodes is also beneficial. For example, if you add a node to a 20-node cluster with many vnodes, each existing node will contribute 5% of the data towards the new node, and all nodes will participate in streaming (meaning the impact on any single node will be limited, and completion time will be faster).

With a low number of vnodes, only a few nodes participate in streaming, which means that the cluster is left unbalanced and the impact on each streaming node is greater (or that completion time is slower).


Similarly, with a high number of vnodes, if a node is down its work is distributed equally among all nodes. With a low number of vnodes the cluster becomes unbalanced.


Overall I recommend high vnode count, and to limit the impact of failures in other ways (smaller number of large nodes vs. larger number of small nodes).


btw, rack-aware topology improves the multi-failure problem but at the cost of causing imbalance during maintenance operations. I recommend using rack-aware topology only if you really have racks with single-points-of-failure, not for other reasons.


On 01/17/2018 05:43 AM, kurt greaves wrote:
Even with a low amount of vnodes you're asking for a bad time. Even if you managed to get down to 2 vnodes per node, you're still likely to include double the amount of nodes in any streaming/repair operation which will likely be very problematic for incremental repairs, and you still won't be able to easily reason about which nodes are responsible for which token ranges. It's still quite likely that a loss of 2 nodes would mean some portion of the ring is down (at QUORUM). At the moment I'd say steer clear of vnodes and use single tokens if you can; a lot of work still needs to be done to ensure smooth operation of C* while using vnodes, and they are much more difficult to reason about (which is probably the reason no one has bothered to do the math). If you're really keen on the math your best bet is to do it yourself, because it's not a point of interest for many C* devs plus probably a lot of us wouldn't remember enough math to know how to approach it.

If you want to get out of this situation you'll need to do a DC migration to a new DC with a better configuration of snitch/replication strategy/racks/tokens.


On 16 January 2018 at 21:54, Kyrylo Lebediev <kyrylo_lebed...@epam.com <mailto:kyrylo_lebed...@epam.com>> wrote:

    Thank you for this valuable info, Jon.
    I guess both you and Alex are referring to improved vnodes
    allocation method
    https://issues.apache.org/jira/browse/CASSANDRA-7032
    <https://issues.apache.org/jira/browse/CASSANDRA-7032> which was
    implemented in 3.0.

    Based on your info and comments in the ticket it's really a bad
    idea to have small number of vnodes for the versions using old
    allocation method because of hot-spots, so it's not an option for
    my particular case (v.2.1) :(

    [As far as I can see from the source code this new method
    wasn't backported to 2.1.]



    Regards,

    Kyrill

    [CASSANDRA-7032] Improve vnode allocation - ASF JIRA
    <https://issues.apache.org/jira/browse/CASSANDRA-7032>
    issues.apache.org <http://issues.apache.org>
    It's been known for a little while that random vnode allocation
    causes hotspots of ownership. It should be possible to improve
    dramatically on this with deterministic ...


    ------------------------------------------------------------------------
    *From:* Jon Haddad <jonathan.had...@gmail.com
    <mailto:jonathan.had...@gmail.com>> on behalf of Jon Haddad
    <j...@jonhaddad.com <mailto:j...@jonhaddad.com>>
    *Sent:* Tuesday, January 16, 2018 8:21:33 PM

    *To:* user@cassandra.apache.org <mailto:user@cassandra.apache.org>
    *Subject:* Re: vnodes: high availability
    We’ve used 32 tokens pre 3.0.  It’s been a mixed result due to the
    randomness.  There’s going to be some imbalance, the amount of
    imbalance depends on luck, unfortunately.

    I’m interested to hear your results using 4 tokens, would you mind
    letting the ML know your experience when you’ve done it?

    Jon

    On Jan 16, 2018, at 9:40 AM, Kyrylo Lebediev
    <kyrylo_lebed...@epam.com <mailto:kyrylo_lebed...@epam.com>> wrote:

    Agree with you, Jon.
    Actually, this cluster was configured by my 'predecessor' and
    [fortunately for him] we've never met :)
    We're using version 2.1.15 and can't upgrade because of legacy
    Netflix Astyanax client used.

    Below in the thread Alex mentioned that it's recommended to set
    vnodes to a value lower than 256 only for C* version > 3.0 (token
    allocation algorithm was improved since C* 3.0) .

    Jon,
    Do you have positive experience setting up cluster with vnodes <
    256 for  C* 2.1?

    vnodes=32 also too high, as for me (we need to have much more
    than 32 servers per AZ in order to to get 'reliable' cluster)
    vnodes=4 seems to be better from HA + balancing trade-off

    Thanks,
    Kyrill
    ------------------------------------------------------------------------
    *From:*Jon Haddad <jonathan.had...@gmail.com
    <mailto:jonathan.had...@gmail.com>> on behalf of Jon Haddad
    <j...@jonhaddad.com <mailto:j...@jonhaddad.com>>
    *Sent:*Tuesday, January 16, 2018 6:44:53 PM
    *To:*user
    *Subject:*Re: vnodes: high availability
    While all the token math is helpful, I have to also call out the
    elephant in the room:

    You have not correctly configured Cassandra for production.

    If you had used the correct endpoint snitch & network topology
    strategy, you would be able to withstand the complete failure of
    an entire availability zone at QUORUM, or two if you queried at
    CL=ONE.

    You are correct about 256 tokens causing issues, it’s one of the
    reasons why we recommend 32.  I’m curious how things behave going
    as low as 4, personally, but I haven’t done the math / tested it yet.



    On Jan 16, 2018, at 2:02 AM, Kyrylo Lebediev
    <kyrylo_lebed...@epam.com <mailto:kyrylo_lebed...@epam.com>> wrote:

    ...to me it sounds like 'C* isn't that highly-available by
    design as it's declared'.
    More nodes in a cluster means higher probability of simultaneous
    node failures.
    And from high-availability standpoint, looks like situation is
    made even worse by recommendedsettingvnodes=256.

    Need to do some math to get numbers/formulas, but now situation
    doesn't seem to be promising.
    In case smb from C* developers/architects is reading this
    message, I'd be grateful to get some links to calculations of C*
    reliability based on which decisions were made.

    Regards,
    Kyrill
    ------------------------------------------------------------------------
    *From:*kurt greaves <k...@instaclustr.com
    <mailto:k...@instaclustr.com>>
    *Sent:*Tuesday, January 16, 2018 2:16:34 AM
    *To:*User
    *Subject:*Re: vnodes: high availability
    Yeah it's very unlikely that you will have 2 nodes in the
    cluster with NO intersecting token ranges (vnodes) for an RF of
    3 (probably even 2).

    If node A goes down all 256 ranges will go down, and considering
    there are only 49 other nodes all with 256 vnodes each, it's
    very likely that every node will be responsible for some range A
    was also responsible for. I'm not sure what the exact math is,
    but think of it this way: If on each node, any of its 256 token
    ranges overlap (it's within the next RF-1 or previous RF-1 token
    ranges) on the ring with a token range on node A those token
    ranges will be down at QUORUM.

    Because token range assignment just uses rand() under the hood,
    I'm sure you could prove that it's always going to be the case
    that any 2 nodes going down result in a loss of QUORUM for some
    token range.

    On 15 January 2018 at 19:59, Kyrylo
    Lebediev<kyrylo_lebed...@epam.com
    <mailto:kyrylo_lebed...@epam.com>>wrote:

        Thanks Alexander!

        I'm not a MS in math too) Unfortunately.

        Not sure, but it seems to me that probability of 2/49 in
        your explanation doesn't take into account that vnodes
        endpoints are almost evenly distributed across all nodes (al
        least it's what I can see from "nodetool ring" output).

        
http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
        
<http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html>
        of course this vnodes illustration is a theoretical one, but
        there no 2 nodes on that diagram that can be switched off
        without losing a key range (at CL=QUORUM).

        That's because vnodes_per_node=8 > Nnodes=6.
        As far as I understand, situation is getting worse with
        increase of vnodes_per_node/Nnode ratio.
        Please, correct me if I'm wrong.

        How would the situation differ from this example by
        DataStax, if we had a real-life 6-nodes cluster with 8
        vnodes on each node?

        Regards,
        Kyrill

        ------------------------------------------------------------------------
        *From:*Alexander Dejanovski <a...@thelastpickle.com
        <mailto:a...@thelastpickle.com>>
        *Sent:*Monday, January 15, 2018 8:14:21 PM

        *To:*user@cassandra.apache.org
        <mailto:user@cassandra.apache.org>
        *Subject:*Re: vnodes: high availability
        I was corrected off list that the odds of losing data when 2
        nodes are down isn't dependent on the number of vnodes, but
        only on the number of nodes.
        The more vnodes, the smaller the chunks of data you may
        lose, and vice versa.
        I officially suck at statistics, as expected :)

        Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski
        <a...@thelastpickle.com <mailto:a...@thelastpickle.com>> a
        écrit :

            Hi Kyrylo,

            the situation is a bit more nuanced than shown by the
            Datastax diagram, which is fairly theoretical.
            If you're using SimpleStrategy, there is no rack
            awareness. Since vnode distribution is purely random,
            and the replica for a vnode will be placed on the node
            that owns the next vnode in token order (yeah, that's
            not easy to formulate), you end up with statistics only.

            I kinda suck at maths but I'm going to risk making a
            fool of myself :)

            The odds for one vnode to be replicated on another node
            are, in your case, 2/49 (out of 49 remaining nodes, 2
            replicas need to be placed).
            Given you have 256 vnodes, the odds for at least one
            vnode of a single node to exist on another one is
            256*(2/49) = 10.4%
            Since the relationship is bi-directional (there are the
            same odds for node B to have a vnode replicated on node
            A than the opposite), that doubles the odds of 2 nodes
            being both replica for at least one vnode : 20.8%.

            Having a smaller number of vnodes will decrease the
            odds, just as having more nodes in the cluster.
            (now once again, I hope my maths aren't fully wrong, I'm
            pretty rusty in that area...)

            How many queries that will affect is a different
            question as it depends on which partition currently
            exist and are queried in the unavailable token ranges.

            Then you have rack awareness that comes with
            NetworkTopologyStrategy :
            If the number of replicas (3 in your case) is
            proportional to the number of racks, Cassandra will
            spread replicas in different ones.
            In that situation, you can theoretically lose as many
            nodes as you want in a single rack, you will still have
            two other replicas available to satisfy quorum in the
            remaining racks.
            If you start losing nodes in different racks, we're back
            to doing maths (but the odds will get slightly different).

            That makes maintenance predictable because you can shut
            down as many nodes as you want in a single rack without
            losing QUORUM.

            Feel free to correct my numbers if I'm wrong.

            Cheers,





            On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev
            <kyrylo_lebed...@epam.com
            <mailto:kyrylo_lebed...@epam.com>> wrote:

                Thanks, Rahul.
                But in your example, at the same time loss of Node3
                and Node6 leads to loss of ranges N, C, J at
                consistency level QUORUM.

                As far as I understand in case vnodes >
                N_nodes_in_cluster and endpoint_snitch=SimpleSnitch,
                since:

                1) "secondary" replicas are placed on two nodes
                'next' to the node responsible for a range (in case
                of RF=3)
                2) there are a lot of vnodes on each node
                3) ranges are evenly distribusted between vnodes in
                case ofSimpleSnitch,

                we get all physical nodes (servers) having mutually
                adjacent token rages.
                Is it correct?

                At least in case of my real-world ~50-nodes cluster
                with nvodes=256, RF=3 for this command:

                nodetool ring | grep '^<ip-prefix>' | awk '{print
                $1}' | uniq | grep -B2 -A2 '<ip_of_a_node>' | grep
                -v '<ip_of_a_node>' | grep -v '^--' | sort | uniq |
                wc -l

                returned number which equals to Nnodes -1, what
                means that I can't switch off 2 nodes at the same
                time w/o losing of some keyrange for CL=QUORUM.

                Thanks,
                Kyrill
                
------------------------------------------------------------------------
                *From:*Rahul Neelakantan <ra...@rahul.be
                <mailto:ra...@rahul.be>>
                *Sent:*Monday, January 15, 2018 5:20:20 PM
                *To:*user@cassandra.apache.org
                <mailto:user@cassandra.apache.org>
                *Subject:*Re: vnodes: high availability
                Not necessarily. It depends on how the token ranges
                for the vNodes are assigned to them. For example
                take a look at this diagram
                
http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
                
<http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html>

                In the vNode part of the diagram, you will see that
                Loss of Node 3 and Node 6, will still not have any
                effect on Token Range A. But yes if you lose two
                nodes that both have Token Range A assigned to them
                (Say Node 1 and Node 2), you will have
                unavailability with your specified configuration.

                You can sort of circumvent this by using the
                DataStax Java Driver and having the client recognize
                a degraded cluster and operate temporarily in
                downgraded consistency mode

                
http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html
                
<http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html>

                - Rahul

                On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo
                Lebediev<kyrylo_lebed...@epam.com
                <mailto:kyrylo_lebed...@epam.com>>wrote:

                    Hi,

                    Let's say we have a C* cluster with following
                    parameters:
                     - 50 nodes in the cluster
                     - RF=3
                     - vnodes=256 per node
                     - CL for some queries = QUORUM
                     - endpoint_snitch = SimpleSnitch

                    Is it correct that 2 any nodes down will cause
                    unavailability of a keyrange at CL=QUORUM?

                    Regards,
                    Kyrill




            --
            -----------------
            Alexander Dejanovski
            France
            @alexanderdeja

            Consultant
            Apache Cassandra Consulting
            http://www.thelastpickle.com <http://www.thelastpickle.com/>

        --
        -----------------
        Alexander Dejanovski
        France
        @alexanderdeja

        Consultant
        Apache Cassandra Consulting
        http://www.thelastpickle.com <http://www.thelastpickle.com/>




Reply via email to