Your rack awareness problem is described in https://issues.apache.org/jira/browse/CASSANDRA-3810 from 2012.
The fundamental problem is that Cassandra wont move data except during bootstrap, decom, and explicit moves. The implication here is exactly what you've encountered - if you tell cassandra to use racks, it's going to distribute one replica onto each rack. To make rack awareness work, it has to move that data on bootstrap, otherwise the first read will immediately violate data placement rules and miss finding the data on read. When you move the data on bootstrap, you have a state transition problem for which nobody has proposed a workaround (because it's approximately very hard given cassandra's architecture). If you want to use rack awareness, you need to start with # of racks >= replication factor. Any other configuration is moving from an invalid state to a valid state, and that state transition is VERY bumpy. Beyond that, your replication factors dont make sense (as others have pointed out), and you dont have to pay to be told that, you can find free doc content / youtube content that teaches you the same thing. I'm not a datastax employee, but their dev rel team has a TON of free content on youtube that does a very good job describing the tradeoffs. For your actual problem, beyond the fact that you're streaming a copy of all of the data in the cluster because of the 3810/rack count problem, the following things are true: - You'll almost certainly always stream from all the hosts in the cluster because you're using vnodes, and this is one of the fundamental reasons vnodes were introduced. By adding extra ranges to a node, you add extra streaming sources. This is a feature to increase speed, but - You're probably streaming too fast, causing GC pauses that's breaking streaming and causing the joining node to drop from the cluster. I'm not positive here, but if I had to guess based on all the other defaults I see, it may be because it's using STCS and deserializing/reserializing every data file rather than using the zero copy streaming on LCS. This means your throttle here is setting the stream throughput via yaml/nodetool, to let it stream at a consistent rate without overrunning GC on the joining node - If it's not that, you're either seeing a bootstrap bug in 4.0 that I haven't seen before (possible), or you're missing another log message somewhere in the cluster, but it's not obvious exactly, I'd probably need to see all of the logs and all of the gossipinfo from the cluster, but I'm muting this thread after this email. - Even if you fix the bootstrap thing, as Bowen pointed out, your replication factor probably won't do what you want. It turns out 2 copies in each of 2 DCs CAN be a valid replication factor, but it requires you understand the visibility tradeoffs (if you write QUORUM, you have an outage if either dc is down or the WAN is cut, if you write LOCAL_QUORUM, you have an outage if any host goes down in the main DC). So if your goal is to reclaim space from HDFS' RF=3 behavior, you're probably solving the wrong problem. On Tue, Jul 12, 2022 at 8:01 AM Marc Hoppins <marc.hopp...@eset.com> wrote: > I posted system log data, GC log data, debug log data, nodetool data. I > believe I had described the situation more than adequately. Yesterday, I > was asking what I assumed to be reasonable questions regarding the method > for adding new nodes to a new rack. > > > > Forgive me if it sounds unreasonable but I asked the same question again: > your response regarding replication suggests that multiple racks in a > datacentre is ALWAYS going to be the case when setting up a Cassandra > cluster. Therefore, I can only assume that when setting up a new cluster > there absolutely MUST be more than one rack. The question I was asking > yesterday regarding adding a new nodes in a new rack has never been > adequately answered here and the only information I can find elsewhere > clearly states that it is not recommended to add more than one new node at > a time to maintain data/token consistency. > > > > So how is it possible to add new hardware when one-at-a-time will > absolutely overload the first node added? That seems like a reasonable, > general question which anyone considering employing the software is going > to ask. > > > > The reply to suggest that folk head off a pay for a course when there are > ‘pre-sales’ questions is not a practical response as any business is > unlikely to be spending speculative money. > > > > *From:* Jeff Jirsa <jji...@gmail.com> > *Sent:* Tuesday, July 12, 2022 4:43 PM > *To:* cassandra <user@cassandra.apache.org> > *Cc:* Bowen Song <bo...@bso.ng> > *Subject:* Re: Adding nodes > > > > EXTERNAL > > > > > > On Tue, Jul 12, 2022 at 7:27 AM Marc Hoppins <marc.hopp...@eset.com> > wrote: > > > > I was asking the questions but no one cared to answer. > > > > This is probably a combination of "it is really hard to answer a question > with insufficient data" and your tone. Nobody here gets paid to help you > solve your company's problems except you. > > > > > > > > >