After reading this *I would only consider moving a cluster to 4 tokens if it is larger than 100 nodes. If you read through the paper that Erick mentioned, written by Joe Lynch & Josh Snyder, they show that the num_tokens impacts the availability of large scale clusters.*
and With 16 tokens, that is vastly improved, but you still have up to 64 nodes each node needs to query against, so you're again, hitting every node unless you go above ~96 nodes in the cluster (assuming 3 racks / AZs). I wouldn't use 16 here, and I doubt any of you would either. I've advocated for 4 tokens because you'd have overlap with only 16 nodes, which works well for small clusters as well as large. Assuming I was creating a new cluster for myself (in a hypothetical brand new application I'm building) I would put this in production. I have worked with several teams where I helped them put 4 token clusters in prod and it has worked very well. We didn't see any wild imbalance issues. from https://lists.apache.org/thread.html/r55d8e68483aea30010a4162ae94e92bc63ed74d486e6c642ee66f6ae%40%3Cuser.cassandra.apache.org%3E Sorry guys, but I am kinda confused now which should be the recommended approach for the number of *vnodes*. Right now I am handling a cluster with just 9 nodes and a data size of 100-200GB per node. I am seeing some unbalancing and I was worried because I have 256 vnodes -- Address Load Tokens Owns Host ID Rack UN 10.1.30.112 115.88 GiB 256 ? e5108a8e-cc2f-4914-a86e-fccf770e3f0f us-east-1b UN 10.1.24.146 127.42 GiB 256 ? adf40fa3-86c4-42c3-bf0a-0f3ee1651696 us-east-1b UN 10.1.26.181 133.44 GiB 256 ? 0a8f07ba-a129-42b0-b73a-df649bd076ef us-east-1b UN 10.1.29.202 113.33 GiB 256 ? d260d719-eae3-48ab-8a98-ea5c7b8f6eb6 us-east-1b UN 10.1.31.60 183.63 GiB 256 ? 3647fcca-688a-4851-ab15-df36819910f4 us-east-1b UN 10.1.24.175 118.09 GiB 256 ? bba1e80b-8156-4399-bd6a-1b5ccb47bddb us-east-1b UN 10.1.29.223 137.24 GiB 256 ? 450fbb61-3817-419a-a4c6-4b652eb5ce01 us-east-1b Weird stuff is related to this post <https://lists.apache.org/thread.html/r92279215bb2e169848cc2b15d320b8a15bfcf1db2dae79d5662c97c5%40%3Cuser.cassandra.apache.org%3E> where I don't find a match between the load and du -sh * for the node 10.1.31.60 and I was trying to figure out the reason, if it was due to the number of vnodes. 2 Out-of-topic questions: 1) Does Cassandra keep a copy of the data per rack so if I need to keep the things balanced and I would have to add 3 racks at the time in a single Datacenter keep the things balanced? 2) Is it better to keep a single Rack with a single Datacenter in 3 different availability zones with replication factor = 3 or to have for each Datacenter: 1 Rack and 1 Availability Zone and eventually redirect the client to a fallback Datacenter in case one of the availability zone is not reachable? Right now we are separating the Datacenter for reads from the one that handles the writes... Thanks for your help! Sergio Il giorno dom 2 feb 2020 alle ore 18:36 Anthony Grasso < anthony.gra...@gmail.com> ha scritto: > Hi Sergio, > > There is a misunderstanding here. My post makes no recommendation for the > value of num_tokens. Rather, it focuses on how to use > the allocate_tokens_for_keyspace setting when creating a new cluster. > > Whilst a value of 4 is used for num_tokens in the post, it was chosen for > demonstration purposes. Specifically it makes: > > - the uneven token distribution in a small cluster very obvious, > - identifying the endpoints displayed in nodetool ring easy, and > - the initial_token setup less verbose and easier to follow. > > I will add an editorial note to the post with the above information > so there is no confusion about why 4 tokens were used. > > I would only consider moving a cluster to 4 tokens if it is larger than > 100 nodes. If you read through the paper that Erick mentioned, written > by Joe Lynch & Josh Snyder, they show that the num_tokens impacts the > availability of large scale clusters. > > If you are after more details about the trade-offs between different sized > token values, please see the discussion on the dev mailing list: "[Discuss] > num_tokens default in Cassandra 4.0 > <https://www.mail-archive.com/search?l=dev%40cassandra.apache.org&q=subject%3A%22%5C%5BDiscuss%5C%5D+num_tokens+default+in+Cassandra+4.0%22&o=oldest> > ". > > Regards, > Anthony > > On Sat, 1 Feb 2020 at 10:07, Sergio <lapostadiser...@gmail.com> wrote: > >> >> https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html >> This >> is the article with 4 token recommendations. >> @Erick Ramirez. which is the dev thread for the default 32 tokens >> recommendation? >> >> Thanks, >> Sergio >> >> Il giorno ven 31 gen 2020 alle ore 14:49 Erick Ramirez < >> flightc...@gmail.com> ha scritto: >> >>> There's an active discussion going on right now in a separate dev >>> thread. The current "default recommendation" is 32 tokens. But there's a >>> push for 4 in combination with allocate_tokens_for_keyspace from Jon >>> Haddad & co (based on a paper from Joe Lynch & Josh Snyder). >>> >>> If you're satisfied with the results from your own testing, go with 4 >>> tokens. And that's the key -- you must test, test, TEST! Cheers! >>> >>> On Sat, Feb 1, 2020 at 5:17 AM Arvinder Dhillon <dhillona...@gmail.com> >>> wrote: >>> >>>> What is recommended vnodes now? I read 8 in later cassandra 3.x >>>> Is the new recommendation 4 now even in version 3.x (asking for 3.11)? >>>> Thanks >>>> >>>> On Fri, Jan 31, 2020 at 9:49 AM Durity, Sean R < >>>> sean_r_dur...@homedepot.com> wrote: >>>> >>>>> These are good clarifications and expansions. >>>>> >>>>> >>>>> >>>>> Sean Durity >>>>> >>>>> >>>>> >>>>> *From:* Anthony Grasso <anthony.gra...@gmail.com> >>>>> *Sent:* Thursday, January 30, 2020 7:25 PM >>>>> *To:* user <user@cassandra.apache.org> >>>>> *Subject:* Re: [EXTERNAL] How to reduce vnodes without downtime >>>>> >>>>> >>>>> >>>>> Hi Maxim, >>>>> >>>>> >>>>> >>>>> Basically what Sean suggested is the way to do this without downtime. >>>>> >>>>> >>>>> >>>>> To clarify the, the *three* steps following the "Decommission each >>>>> node in the DC you are working on" step should be applied to *only* >>>>> the decommissioned nodes. So where it say "*all nodes*" or "*every >>>>> node*" it applies to only the decommissioned nodes. >>>>> >>>>> >>>>> >>>>> In addition, the step that says "Wipe data on all the nodes", I would >>>>> delete all files in the following directories on the decommissioned nodes. >>>>> >>>>> - data (usually located in /var/lib/cassandra/data) >>>>> - commitlogs (usually located in /var/lib/cassandra/commitlogs) >>>>> - hints (usually located in /var/lib/casandra/hints) >>>>> - saved_caches (usually located in /var/lib/cassandra/saved_caches) >>>>> >>>>> >>>>> >>>>> Cheers, >>>>> >>>>> Anthony >>>>> >>>>> >>>>> >>>>> On Fri, 31 Jan 2020 at 03:05, Durity, Sean R < >>>>> sean_r_dur...@homedepot.com> wrote: >>>>> >>>>> Your procedure won’t work very well. On the first node, if you >>>>> switched to 4, you would end up with only a tiny fraction of the data >>>>> (because the other nodes would still be at 256). I updated a large cluster >>>>> (over 150 nodes – 2 DCs) to smaller number of vnodes. The basic outline >>>>> was >>>>> this: >>>>> >>>>> >>>>> >>>>> - Stop all repairs >>>>> - Make sure the app is running against one DC only >>>>> - Change the replication settings on keyspaces to use only 1 DC >>>>> (basically cutting off the other DC) >>>>> - Decommission each node in the DC you are working on. Because the >>>>> replication setting are changed, no streaming occurs. But it releases >>>>> the >>>>> token assignments >>>>> - Wipe data on all the nodes >>>>> - Update configuration on every node to your new settings, >>>>> including auto_bootstrap = false >>>>> - Start all nodes. They will choose tokens, but not stream any data >>>>> - Update replication factor for all keyspaces to include the new DC >>>>> - I disabled binary on those nodes to prevent app connections >>>>> - Run nodetool reduild with -dc (other DC) on as many nodes as >>>>> your system can safely handle until they are all rebuilt. >>>>> - Re-enable binary (and app connections to the rebuilt DC) >>>>> - Turn on repairs >>>>> - Rest for a bit, then reverse the process for the remaining DCs >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Sean Durity – Staff Systems Engineer, Cassandra >>>>> >>>>> >>>>> >>>>> *From:* Maxim Parkachov <lazy.gop...@gmail.com> >>>>> *Sent:* Thursday, January 30, 2020 10:05 AM >>>>> *To:* user@cassandra.apache.org >>>>> *Subject:* [EXTERNAL] How to reduce vnodes without downtime >>>>> >>>>> >>>>> >>>>> Hi everyone, >>>>> >>>>> >>>>> >>>>> with discussion about reducing default vnodes in version 4.0 I would >>>>> like to ask, what would be optimal procedure to perform reduction of >>>>> vnodes >>>>> in existing 3.11.x cluster which was set up with default value 256. >>>>> Cluster >>>>> has 2 DC with 5 nodes each and RF=3. There is one more restriction, I >>>>> could >>>>> not add more servers, nor to create additional DC, everything is physical. >>>>> This should be done without downtime. >>>>> >>>>> >>>>> >>>>> My idea for such procedure would be >>>>> >>>>> >>>>> >>>>> for each node: >>>>> >>>>> - decommission node >>>>> >>>>> - set auto_bootstrap to true and vnodes to 4 >>>>> >>>>> - start and wait till node joins cluster >>>>> >>>>> - run cleanup on rest of nodes in cluster >>>>> >>>>> - run repair on whole cluster (not sure if needed after cleanup) >>>>> >>>>> - set auto_bootstrap to false >>>>> >>>>> repeat for each node >>>>> >>>>> >>>>> >>>>> rolling restart of cluster >>>>> >>>>> cluster repair >>>>> >>>>> >>>>> >>>>> Is this sounds right ? My concern is that after decommission, node >>>>> will start on the same IP which could create some confusion. >>>>> >>>>> >>>>> >>>>> Regards, >>>>> >>>>> Maxim. >>>>> >>>>> >>>>> ------------------------------ >>>>> >>>>> >>>>> The information in this Internet Email is confidential and may be >>>>> legally privileged. It is intended solely for the addressee. Access to >>>>> this >>>>> Email by anyone else is unauthorized. If you are not the intended >>>>> recipient, any disclosure, copying, distribution or any action taken or >>>>> omitted to be taken in reliance on it, is prohibited and may be unlawful. >>>>> When addressed to our clients any opinions or advice contained in this >>>>> Email are subject to the terms and conditions expressed in any applicable >>>>> governing The Home Depot terms of business or client engagement letter. >>>>> The >>>>> Home Depot disclaims all responsibility and liability for the accuracy and >>>>> content of this attachment and for any damages or losses arising from any >>>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other >>>>> items of a destructive nature, which may be contained in this attachment >>>>> and shall not be liable for direct, indirect, consequential or special >>>>> damages in connection with this e-mail message or its attachment. >>>>> >>>>> >>>>> ------------------------------ >>>>> >>>>> The information in this Internet Email is confidential and may be >>>>> legally privileged. It is intended solely for the addressee. Access to >>>>> this >>>>> Email by anyone else is unauthorized. If you are not the intended >>>>> recipient, any disclosure, copying, distribution or any action taken or >>>>> omitted to be taken in reliance on it, is prohibited and may be unlawful. >>>>> When addressed to our clients any opinions or advice contained in this >>>>> Email are subject to the terms and conditions expressed in any applicable >>>>> governing The Home Depot terms of business or client engagement letter. >>>>> The >>>>> Home Depot disclaims all responsibility and liability for the accuracy and >>>>> content of this attachment and for any damages or losses arising from any >>>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other >>>>> items of a destructive nature, which may be contained in this attachment >>>>> and shall not be liable for direct, indirect, consequential or special >>>>> damages in connection with this e-mail message or its attachment. >>>>> >>>>