Re: [EXTERNAL] How to reduce vnodes without downtime

Sergio Mon, 03 Feb 2020 23:37:55 -0800

After reading this

*I would only consider moving a cluster to 4 tokens if it is larger than
100 nodes. If you read through the paper that Erick mentioned, written
by Joe Lynch & Josh Snyder, they show that the num_tokens impacts the
availability of large scale clusters.*


and

With 16 tokens, that is vastly improved, but you still have up to 64 nodes
each node needs to query against, so you're again, hitting every node
unless you go above ~96 nodes in the cluster (assuming 3 racks / AZs).  I
wouldn't use 16 here, and I doubt any of you would either.  I've advocated
for 4 tokens because you'd have overlap with only 16 nodes, which works
well for small clusters as well as large.  Assuming I was creating a new
cluster for myself (in a hypothetical brand new application I'm building) I
would put this in production.  I have worked with several teams where I
helped them put 4 token clusters in prod and it has worked very well.  We
didn't see any wild imbalance issues.

from
https://lists.apache.org/thread.html/r55d8e68483aea30010a4162ae94e92bc63ed74d486e6c642ee66f6ae%40%3Cuser.cassandra.apache.org%3E

Sorry guys, but I am kinda confused now which should be the recommended
approach for the number of *vnodes*.
Right now I am handling a cluster with just 9 nodes and a data size of
100-200GB per node.

I am seeing some unbalancing and I was worried because I have 256 vnodes

--  Address      Load       Tokens       Owns    Host ID
            Rack
UN  10.1.30.112  115.88 GiB  256          ?
e5108a8e-cc2f-4914-a86e-fccf770e3f0f  us-east-1b
UN  10.1.24.146  127.42 GiB  256          ?
adf40fa3-86c4-42c3-bf0a-0f3ee1651696  us-east-1b
UN  10.1.26.181  133.44 GiB  256          ?
0a8f07ba-a129-42b0-b73a-df649bd076ef  us-east-1b
UN  10.1.29.202  113.33 GiB  256          ?
d260d719-eae3-48ab-8a98-ea5c7b8f6eb6  us-east-1b
UN  10.1.31.60   183.63 GiB  256          ?
3647fcca-688a-4851-ab15-df36819910f4  us-east-1b
UN  10.1.24.175  118.09 GiB  256          ?
bba1e80b-8156-4399-bd6a-1b5ccb47bddb  us-east-1b
UN  10.1.29.223  137.24 GiB  256          ?
450fbb61-3817-419a-a4c6-4b652eb5ce01  us-east-1b

Weird stuff is related to this post
<https://lists.apache.org/thread.html/r92279215bb2e169848cc2b15d320b8a15bfcf1db2dae79d5662c97c5%40%3Cuser.cassandra.apache.org%3E>
where I don't find a match between the load and du -sh * for the node
10.1.31.60 and I was trying to figure out the reason, if it was due to the
number of vnodes.

2 Out-of-topic questions:

1)
Does Cassandra keep a copy of the data per rack so if I need to keep the
things balanced and I would have to add 3 racks at the time in a single
Datacenter keep the things balanced?

2) Is it better to keep a single Rack with a single Datacenter in 3
different availability zones with replication factor = 3 or to have for
each Datacenter: 1 Rack and 1 Availability Zone and eventually redirect the
client to a fallback Datacenter in case one of the availability zone is not
reachable?

Right now we are separating the Datacenter for reads from the one that
handles the writes...

Thanks for your help!

Sergio




Il giorno dom 2 feb 2020 alle ore 18:36 Anthony Grasso <
anthony.gra...@gmail.com> ha scritto:

> Hi Sergio,
>
> There is a misunderstanding here. My post makes no recommendation for the
> value of num_tokens. Rather, it focuses on how to use
> the allocate_tokens_for_keyspace setting when creating a new cluster.
>
> Whilst a value of 4 is used for num_tokens in the post, it was chosen for
> demonstration purposes. Specifically it makes:
>
>    - the uneven token distribution in a small cluster very obvious,
>    - identifying the endpoints displayed in nodetool ring easy, and
>    - the initial_token setup less verbose and easier to follow.
>
> I will add an editorial note to the post with the above information
> so there is no confusion about why 4 tokens were used.
>
> I would only consider moving a cluster to 4 tokens if it is larger than
> 100 nodes. If you read through the paper that Erick mentioned, written
> by Joe Lynch & Josh Snyder, they show that the num_tokens impacts the
> availability of large scale clusters.
>
> If you are after more details about the trade-offs between different sized
> token values, please see the discussion on the dev mailing list: "[Discuss]
> num_tokens default in Cassandra 4.0
> <https://www.mail-archive.com/search?l=dev%40cassandra.apache.org&q=subject%3A%22%5C%5BDiscuss%5C%5D+num_tokens+default+in+Cassandra+4.0%22&o=oldest>
> ".
>
> Regards,
> Anthony
>
> On Sat, 1 Feb 2020 at 10:07, Sergio <lapostadiser...@gmail.com> wrote:
>
>>
>> https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html
>>  This
>> is the article with 4 token recommendations.
>> @Erick Ramirez. which is the dev thread for the default 32 tokens
>> recommendation?
>>
>> Thanks,
>> Sergio
>>
>> Il giorno ven 31 gen 2020 alle ore 14:49 Erick Ramirez <
>> flightc...@gmail.com> ha scritto:
>>
>>> There's an active discussion going on right now in a separate dev
>>> thread. The current "default recommendation" is 32 tokens. But there's a
>>> push for 4 in combination with allocate_tokens_for_keyspace from Jon
>>> Haddad & co (based on a paper from Joe Lynch & Josh Snyder).
>>>
>>> If you're satisfied with the results from your own testing, go with 4
>>> tokens. And that's the key -- you must test, test, TEST! Cheers!
>>>
>>> On Sat, Feb 1, 2020 at 5:17 AM Arvinder Dhillon <dhillona...@gmail.com>
>>> wrote:
>>>
>>>> What is recommended vnodes now? I read 8 in later cassandra 3.x
>>>> Is the new recommendation 4 now even in version 3.x (asking for 3.11)?
>>>> Thanks
>>>>
>>>> On Fri, Jan 31, 2020 at 9:49 AM Durity, Sean R <
>>>> sean_r_dur...@homedepot.com> wrote:
>>>>
>>>>> These are good clarifications and expansions.
>>>>>
>>>>>
>>>>>
>>>>> Sean Durity
>>>>>
>>>>>
>>>>>
>>>>> *From:* Anthony Grasso <anthony.gra...@gmail.com>
>>>>> *Sent:* Thursday, January 30, 2020 7:25 PM
>>>>> *To:* user <user@cassandra.apache.org>
>>>>> *Subject:* Re: [EXTERNAL] How to reduce vnodes without downtime
>>>>>
>>>>>
>>>>>
>>>>> Hi Maxim,
>>>>>
>>>>>
>>>>>
>>>>> Basically what Sean suggested is the way to do this without downtime.
>>>>>
>>>>>
>>>>>
>>>>> To clarify the, the *three* steps following the "Decommission each
>>>>> node in the DC you are working on" step should be applied to *only*
>>>>> the decommissioned nodes. So where it say "*all nodes*" or "*every
>>>>> node*" it applies to only the decommissioned nodes.
>>>>>
>>>>>
>>>>>
>>>>> In addition, the step that says "Wipe data on all the nodes", I would
>>>>> delete all files in the following directories on the decommissioned nodes.
>>>>>
>>>>>    - data (usually located in /var/lib/cassandra/data)
>>>>>    - commitlogs (usually located in /var/lib/cassandra/commitlogs)
>>>>>    - hints (usually located in /var/lib/casandra/hints)
>>>>>    - saved_caches (usually located in /var/lib/cassandra/saved_caches)
>>>>>
>>>>>
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Anthony
>>>>>
>>>>>
>>>>>
>>>>> On Fri, 31 Jan 2020 at 03:05, Durity, Sean R <
>>>>> sean_r_dur...@homedepot.com> wrote:
>>>>>
>>>>> Your procedure won’t work very well. On the first node, if you
>>>>> switched to 4, you would end up with only a tiny fraction of the data
>>>>> (because the other nodes would still be at 256). I updated a large cluster
>>>>> (over 150 nodes – 2 DCs) to smaller number of vnodes. The basic outline 
>>>>> was
>>>>> this:
>>>>>
>>>>>
>>>>>
>>>>>    - Stop all repairs
>>>>>    - Make sure the app is running against one DC only
>>>>>    - Change the replication settings on keyspaces to use only 1 DC
>>>>>    (basically cutting off the other DC)
>>>>>    - Decommission each node in the DC you are working on. Because the
>>>>>    replication setting are changed, no streaming occurs. But it releases 
>>>>> the
>>>>>    token assignments
>>>>>    - Wipe data on all the nodes
>>>>>    - Update configuration on every node to your new settings,
>>>>>    including auto_bootstrap = false
>>>>>    - Start all nodes. They will choose tokens, but not stream any data
>>>>>    - Update replication factor for all keyspaces to include the new DC
>>>>>    - I disabled binary on those nodes to prevent app connections
>>>>>    - Run nodetool reduild with -dc (other DC) on as many nodes as
>>>>>    your system can safely handle until they are all rebuilt.
>>>>>    - Re-enable binary (and app connections to the rebuilt DC)
>>>>>    - Turn on repairs
>>>>>    - Rest for a bit, then reverse the process for the remaining DCs
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Sean Durity – Staff Systems Engineer, Cassandra
>>>>>
>>>>>
>>>>>
>>>>> *From:* Maxim Parkachov <lazy.gop...@gmail.com>
>>>>> *Sent:* Thursday, January 30, 2020 10:05 AM
>>>>> *To:* user@cassandra.apache.org
>>>>> *Subject:* [EXTERNAL] How to reduce vnodes without downtime
>>>>>
>>>>>
>>>>>
>>>>> Hi everyone,
>>>>>
>>>>>
>>>>>
>>>>> with discussion about reducing default vnodes in version 4.0 I would
>>>>> like to ask, what would be optimal procedure to perform reduction of 
>>>>> vnodes
>>>>> in existing 3.11.x cluster which was set up with default value 256. 
>>>>> Cluster
>>>>> has 2 DC with 5 nodes each and RF=3. There is one more restriction, I 
>>>>> could
>>>>> not add more servers, nor to create additional DC, everything is physical.
>>>>> This should be done without downtime.
>>>>>
>>>>>
>>>>>
>>>>> My idea for such procedure would be
>>>>>
>>>>>
>>>>>
>>>>> for each node:
>>>>>
>>>>> - decommission node
>>>>>
>>>>> - set auto_bootstrap to true and vnodes to 4
>>>>>
>>>>> - start and wait till node joins cluster
>>>>>
>>>>> - run cleanup on rest of nodes in cluster
>>>>>
>>>>> - run repair on whole cluster (not sure if needed after cleanup)
>>>>>
>>>>> - set auto_bootstrap to false
>>>>>
>>>>> repeat for each node
>>>>>
>>>>>
>>>>>
>>>>> rolling restart of cluster
>>>>>
>>>>> cluster repair
>>>>>
>>>>>
>>>>>
>>>>> Is this sounds right ? My concern is that after decommission, node
>>>>> will start on the same IP which could create some confusion.
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Maxim.
>>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>>
>>>>> The information in this Internet Email is confidential and may be
>>>>> legally privileged. It is intended solely for the addressee. Access to 
>>>>> this
>>>>> Email by anyone else is unauthorized. If you are not the intended
>>>>> recipient, any disclosure, copying, distribution or any action taken or
>>>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>>>> When addressed to our clients any opinions or advice contained in this
>>>>> Email are subject to the terms and conditions expressed in any applicable
>>>>> governing The Home Depot terms of business or client engagement letter. 
>>>>> The
>>>>> Home Depot disclaims all responsibility and liability for the accuracy and
>>>>> content of this attachment and for any damages or losses arising from any
>>>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>>>> items of a destructive nature, which may be contained in this attachment
>>>>> and shall not be liable for direct, indirect, consequential or special
>>>>> damages in connection with this e-mail message or its attachment.
>>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> The information in this Internet Email is confidential and may be
>>>>> legally privileged. It is intended solely for the addressee. Access to 
>>>>> this
>>>>> Email by anyone else is unauthorized. If you are not the intended
>>>>> recipient, any disclosure, copying, distribution or any action taken or
>>>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>>>> When addressed to our clients any opinions or advice contained in this
>>>>> Email are subject to the terms and conditions expressed in any applicable
>>>>> governing The Home Depot terms of business or client engagement letter. 
>>>>> The
>>>>> Home Depot disclaims all responsibility and liability for the accuracy and
>>>>> content of this attachment and for any damages or losses arising from any
>>>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>>>> items of a destructive nature, which may be contained in this attachment
>>>>> and shall not be liable for direct, indirect, consequential or special
>>>>> damages in connection with this e-mail message or its attachment.
>>>>>
>>>>

Re: [EXTERNAL] How to reduce vnodes without downtime

Reply via email to