Re: [EXTERNAL] How to reduce vnodes without downtime

Sergio Tue, 11 Feb 2020 13:31:28 -0800

Do you have any chance to take a look about this one?

Il giorno lun 3 feb 2020 alle ore 23:36 Sergio <lapostadiser...@gmail.com>
ha scritto:


> After reading this
>
> *I would only consider moving a cluster to 4 tokens if it is larger than
> 100 nodes. If you read through the paper that Erick mentioned, written
> by Joe Lynch & Josh Snyder, they show that the num_tokens impacts the
> availability of large scale clusters.*
>
> and
>
> With 16 tokens, that is vastly improved, but you still have up to 64 nodes
> each node needs to query against, so you're again, hitting every node
> unless you go above ~96 nodes in the cluster (assuming 3 racks / AZs).  I
> wouldn't use 16 here, and I doubt any of you would either.  I've advocated
> for 4 tokens because you'd have overlap with only 16 nodes, which works
> well for small clusters as well as large.  Assuming I was creating a new
> cluster for myself (in a hypothetical brand new application I'm building) I
> would put this in production.  I have worked with several teams where I
> helped them put 4 token clusters in prod and it has worked very well.  We
> didn't see any wild imbalance issues.
>
> from
> https://lists.apache.org/thread.html/r55d8e68483aea30010a4162ae94e92bc63ed74d486e6c642ee66f6ae%40%3Cuser.cassandra.apache.org%3E
>
> Sorry guys, but I am kinda confused now which should be the recommended
> approach for the number of *vnodes*.
> Right now I am handling a cluster with just 9 nodes and a data size of
> 100-200GB per node.
>
> I am seeing some unbalancing and I was worried because I have 256 vnodes
>
> --  Address      Load       Tokens       Owns    Host ID
>             Rack
> UN  10.1.30.112  115.88 GiB  256          ?
> e5108a8e-cc2f-4914-a86e-fccf770e3f0f  us-east-1b
> UN  10.1.24.146  127.42 GiB  256          ?
> adf40fa3-86c4-42c3-bf0a-0f3ee1651696  us-east-1b
> UN  10.1.26.181  133.44 GiB  256          ?
> 0a8f07ba-a129-42b0-b73a-df649bd076ef  us-east-1b
> UN  10.1.29.202  113.33 GiB  256          ?
> d260d719-eae3-48ab-8a98-ea5c7b8f6eb6  us-east-1b
> UN  10.1.31.60   183.63 GiB  256          ?
> 3647fcca-688a-4851-ab15-df36819910f4  us-east-1b
> UN  10.1.24.175  118.09 GiB  256          ?
> bba1e80b-8156-4399-bd6a-1b5ccb47bddb  us-east-1b
> UN  10.1.29.223  137.24 GiB  256          ?
> 450fbb61-3817-419a-a4c6-4b652eb5ce01  us-east-1b
>
> Weird stuff is related to this post
> <https://lists.apache.org/thread.html/r92279215bb2e169848cc2b15d320b8a15bfcf1db2dae79d5662c97c5%40%3Cuser.cassandra.apache.org%3E>
> where I don't find a match between the load and du -sh * for the node
> 10.1.31.60 and I was trying to figure out the reason, if it was due to the
> number of vnodes.
>
> 2 Out-of-topic questions:
>
> 1)
> Does Cassandra keep a copy of the data per rack so if I need to keep the
> things balanced and I would have to add 3 racks at the time in a single
> Datacenter keep the things balanced?
>
> 2) Is it better to keep a single Rack with a single Datacenter in 3
> different availability zones with replication factor = 3 or to have for
> each Datacenter: 1 Rack and 1 Availability Zone and eventually redirect the
> client to a fallback Datacenter in case one of the availability zone is not
> reachable?
>
> Right now we are separating the Datacenter for reads from the one that
> handles the writes...
>
> Thanks for your help!
>
> Sergio
>
>
>
>
> Il giorno dom 2 feb 2020 alle ore 18:36 Anthony Grasso <
> anthony.gra...@gmail.com> ha scritto:
>
>> Hi Sergio,
>>
>> There is a misunderstanding here. My post makes no recommendation for the
>> value of num_tokens. Rather, it focuses on how to use
>> the allocate_tokens_for_keyspace setting when creating a new cluster.
>>
>> Whilst a value of 4 is used for num_tokens in the post, it was chosen for
>> demonstration purposes. Specifically it makes:
>>
>>    - the uneven token distribution in a small cluster very obvious,
>>    - identifying the endpoints displayed in nodetool ring easy, and
>>    - the initial_token setup less verbose and easier to follow.
>>
>> I will add an editorial note to the post with the above information
>> so there is no confusion about why 4 tokens were used.
>>
>> I would only consider moving a cluster to 4 tokens if it is larger than
>> 100 nodes. If you read through the paper that Erick mentioned, written
>> by Joe Lynch & Josh Snyder, they show that the num_tokens impacts the
>> availability of large scale clusters.
>>
>> If you are after more details about the trade-offs between different
>> sized token values, please see the discussion on the dev mailing list: 
>> "[Discuss]
>> num_tokens default in Cassandra 4.0
>> <https://www.mail-archive.com/search?l=dev%40cassandra.apache.org&q=subject%3A%22%5C%5BDiscuss%5C%5D+num_tokens+default+in+Cassandra+4.0%22&o=oldest>
>> ".
>>
>> Regards,
>> Anthony
>>
>> On Sat, 1 Feb 2020 at 10:07, Sergio <lapostadiser...@gmail.com> wrote:
>>
>>>
>>> https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html
>>>  This
>>> is the article with 4 token recommendations.
>>> @Erick Ramirez. which is the dev thread for the default 32 tokens
>>> recommendation?
>>>
>>> Thanks,
>>> Sergio
>>>
>>> Il giorno ven 31 gen 2020 alle ore 14:49 Erick Ramirez <
>>> flightc...@gmail.com> ha scritto:
>>>
>>>> There's an active discussion going on right now in a separate dev
>>>> thread. The current "default recommendation" is 32 tokens. But there's a
>>>> push for 4 in combination with allocate_tokens_for_keyspace from Jon
>>>> Haddad & co (based on a paper from Joe Lynch & Josh Snyder).
>>>>
>>>> If you're satisfied with the results from your own testing, go with 4
>>>> tokens. And that's the key -- you must test, test, TEST! Cheers!
>>>>
>>>> On Sat, Feb 1, 2020 at 5:17 AM Arvinder Dhillon <dhillona...@gmail.com>
>>>> wrote:
>>>>
>>>>> What is recommended vnodes now? I read 8 in later cassandra 3.x
>>>>> Is the new recommendation 4 now even in version 3.x (asking for 3.11)?
>>>>> Thanks
>>>>>
>>>>> On Fri, Jan 31, 2020 at 9:49 AM Durity, Sean R <
>>>>> sean_r_dur...@homedepot.com> wrote:
>>>>>
>>>>>> These are good clarifications and expansions.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Sean Durity
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Anthony Grasso <anthony.gra...@gmail.com>
>>>>>> *Sent:* Thursday, January 30, 2020 7:25 PM
>>>>>> *To:* user <user@cassandra.apache.org>
>>>>>> *Subject:* Re: [EXTERNAL] How to reduce vnodes without downtime
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi Maxim,
>>>>>>
>>>>>>
>>>>>>
>>>>>> Basically what Sean suggested is the way to do this without downtime.
>>>>>>
>>>>>>
>>>>>>
>>>>>> To clarify the, the *three* steps following the "Decommission each
>>>>>> node in the DC you are working on" step should be applied to *only*
>>>>>> the decommissioned nodes. So where it say "*all nodes*" or "*every
>>>>>> node*" it applies to only the decommissioned nodes.
>>>>>>
>>>>>>
>>>>>>
>>>>>> In addition, the step that says "Wipe data on all the nodes", I would
>>>>>> delete all files in the following directories on the decommissioned 
>>>>>> nodes.
>>>>>>
>>>>>>    - data (usually located in /var/lib/cassandra/data)
>>>>>>    - commitlogs (usually located in /var/lib/cassandra/commitlogs)
>>>>>>    - hints (usually located in /var/lib/casandra/hints)
>>>>>>    - saved_caches (usually located in
>>>>>>    /var/lib/cassandra/saved_caches)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Anthony
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, 31 Jan 2020 at 03:05, Durity, Sean R <
>>>>>> sean_r_dur...@homedepot.com> wrote:
>>>>>>
>>>>>> Your procedure won’t work very well. On the first node, if you
>>>>>> switched to 4, you would end up with only a tiny fraction of the data
>>>>>> (because the other nodes would still be at 256). I updated a large 
>>>>>> cluster
>>>>>> (over 150 nodes – 2 DCs) to smaller number of vnodes. The basic outline 
>>>>>> was
>>>>>> this:
>>>>>>
>>>>>>
>>>>>>
>>>>>>    - Stop all repairs
>>>>>>    - Make sure the app is running against one DC only
>>>>>>    - Change the replication settings on keyspaces to use only 1 DC
>>>>>>    (basically cutting off the other DC)
>>>>>>    - Decommission each node in the DC you are working on. Because
>>>>>>    the replication setting are changed, no streaming occurs. But it 
>>>>>> releases
>>>>>>    the token assignments
>>>>>>    - Wipe data on all the nodes
>>>>>>    - Update configuration on every node to your new settings,
>>>>>>    including auto_bootstrap = false
>>>>>>    - Start all nodes. They will choose tokens, but not stream any
>>>>>>    data
>>>>>>    - Update replication factor for all keyspaces to include the new
>>>>>>    DC
>>>>>>    - I disabled binary on those nodes to prevent app connections
>>>>>>    - Run nodetool reduild with -dc (other DC) on as many nodes as
>>>>>>    your system can safely handle until they are all rebuilt.
>>>>>>    - Re-enable binary (and app connections to the rebuilt DC)
>>>>>>    - Turn on repairs
>>>>>>    - Rest for a bit, then reverse the process for the remaining DCs
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Sean Durity – Staff Systems Engineer, Cassandra
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Maxim Parkachov <lazy.gop...@gmail.com>
>>>>>> *Sent:* Thursday, January 30, 2020 10:05 AM
>>>>>> *To:* user@cassandra.apache.org
>>>>>> *Subject:* [EXTERNAL] How to reduce vnodes without downtime
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>>
>>>>>>
>>>>>> with discussion about reducing default vnodes in version 4.0 I would
>>>>>> like to ask, what would be optimal procedure to perform reduction of 
>>>>>> vnodes
>>>>>> in existing 3.11.x cluster which was set up with default value 256. 
>>>>>> Cluster
>>>>>> has 2 DC with 5 nodes each and RF=3. There is one more restriction, I 
>>>>>> could
>>>>>> not add more servers, nor to create additional DC, everything is 
>>>>>> physical.
>>>>>> This should be done without downtime.
>>>>>>
>>>>>>
>>>>>>
>>>>>> My idea for such procedure would be
>>>>>>
>>>>>>
>>>>>>
>>>>>> for each node:
>>>>>>
>>>>>> - decommission node
>>>>>>
>>>>>> - set auto_bootstrap to true and vnodes to 4
>>>>>>
>>>>>> - start and wait till node joins cluster
>>>>>>
>>>>>> - run cleanup on rest of nodes in cluster
>>>>>>
>>>>>> - run repair on whole cluster (not sure if needed after cleanup)
>>>>>>
>>>>>> - set auto_bootstrap to false
>>>>>>
>>>>>> repeat for each node
>>>>>>
>>>>>>
>>>>>>
>>>>>> rolling restart of cluster
>>>>>>
>>>>>> cluster repair
>>>>>>
>>>>>>
>>>>>>
>>>>>> Is this sounds right ? My concern is that after decommission, node
>>>>>> will start on the same IP which could create some confusion.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Maxim.
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>>
>>>>>> The information in this Internet Email is confidential and may be
>>>>>> legally privileged. It is intended solely for the addressee. Access to 
>>>>>> this
>>>>>> Email by anyone else is unauthorized. If you are not the intended
>>>>>> recipient, any disclosure, copying, distribution or any action taken or
>>>>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>>>>> When addressed to our clients any opinions or advice contained in this
>>>>>> Email are subject to the terms and conditions expressed in any applicable
>>>>>> governing The Home Depot terms of business or client engagement letter. 
>>>>>> The
>>>>>> Home Depot disclaims all responsibility and liability for the accuracy 
>>>>>> and
>>>>>> content of this attachment and for any damages or losses arising from any
>>>>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>>>>> items of a destructive nature, which may be contained in this attachment
>>>>>> and shall not be liable for direct, indirect, consequential or special
>>>>>> damages in connection with this e-mail message or its attachment.
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> The information in this Internet Email is confidential and may be
>>>>>> legally privileged. It is intended solely for the addressee. Access to 
>>>>>> this
>>>>>> Email by anyone else is unauthorized. If you are not the intended
>>>>>> recipient, any disclosure, copying, distribution or any action taken or
>>>>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>>>>> When addressed to our clients any opinions or advice contained in this
>>>>>> Email are subject to the terms and conditions expressed in any applicable
>>>>>> governing The Home Depot terms of business or client engagement letter. 
>>>>>> The
>>>>>> Home Depot disclaims all responsibility and liability for the accuracy 
>>>>>> and
>>>>>> content of this attachment and for any damages or losses arising from any
>>>>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>>>>> items of a destructive nature, which may be contained in this attachment
>>>>>> and shall not be liable for direct, indirect, consequential or special
>>>>>> damages in connection with this e-mail message or its attachment.
>>>>>>
>>>>>

Re: [EXTERNAL] How to reduce vnodes without downtime

Reply via email to