Your rack awareness problem is described in
https://issues.apache.org/jira/browse/CASSANDRA-3810 from 2012.

The fundamental problem is that Cassandra wont move data except during
bootstrap, decom, and explicit moves.  The implication here is exactly what
you've encountered - if you tell cassandra to use racks, it's going to
distribute one replica onto each rack. To make rack awareness work, it has
to move that data on bootstrap, otherwise the first read will immediately
violate data placement rules and miss finding the data on read. When you
move the data on bootstrap, you have a state transition problem for which
nobody has proposed a workaround (because it's approximately very hard
given cassandra's architecture). If you want to use rack awareness, you
need to start with # of racks >= replication factor. Any other
configuration is moving from an invalid state to a valid state, and that
state transition is VERY bumpy.

Beyond that, your replication factors dont make sense (as others have
pointed out), and you dont have to pay to be told that, you can find free
doc content / youtube content that teaches you the same thing. I'm not a
datastax employee, but their dev rel team has a TON of free content on
youtube that does a very good job describing the tradeoffs.

For your actual problem, beyond the fact that you're streaming a copy of
all of the data in the cluster because of the 3810/rack count problem, the
following things are true:
- You'll almost certainly always stream from all the hosts in the cluster
because you're using vnodes, and this is one of the fundamental reasons
vnodes were introduced. By adding extra ranges to a node, you add extra
streaming sources. This is a feature to increase speed, but
- You're probably streaming too fast, causing GC pauses that's breaking
streaming and causing the joining node to drop from the cluster. I'm not
positive here, but if I had to guess based on all the other defaults I see,
it may be because it's using STCS and deserializing/reserializing every
data file rather than using the zero copy streaming on LCS. This means your
throttle here is setting the stream throughput via yaml/nodetool, to let it
stream at a consistent rate without overrunning GC on the joining node
- If it's not that, you're either seeing a bootstrap bug in 4.0 that I
haven't seen before (possible), or you're missing another log message
somewhere in the cluster, but it's not obvious exactly, I'd probably need
to see all of the logs and all of the gossipinfo from the cluster, but I'm
muting this thread after this email.
- Even if you fix the bootstrap thing, as Bowen pointed out, your
replication factor probably won't do what you want. It turns out 2 copies
in each of 2 DCs CAN be a valid replication factor, but it requires you
understand the visibility tradeoffs (if you write QUORUM, you have an
outage if either dc is down or the WAN is cut, if you write LOCAL_QUORUM,
you have an outage if any host goes down in the main DC). So if your goal
is to reclaim space from HDFS' RF=3 behavior, you're probably solving the
wrong problem.






On Tue, Jul 12, 2022 at 8:01 AM Marc Hoppins <marc.hopp...@eset.com> wrote:

> I posted system log data, GC log data, debug log data, nodetool data.  I
> believe I had described the situation more than adequately. Yesterday, I
> was asking what I assumed to be reasonable questions regarding the method
> for adding new nodes to a new rack.
>
>
>
> Forgive me if it sounds unreasonable but I asked the same question again:
> your response regarding replication suggests that multiple racks in a
> datacentre is ALWAYS going to be the case when setting up a Cassandra
> cluster. Therefore, I can only assume that when setting up a new cluster
> there absolutely MUST be more than one rack.  The question I was asking
> yesterday regarding adding a new nodes in a new rack has never been
> adequately answered here and the only information I can find elsewhere
> clearly states that it is not recommended to add more than one new node at
> a time to maintain data/token consistency.
>
>
>
> So how is it possible to add new hardware when one-at-a-time will
> absolutely overload the first node added?  That seems like a reasonable,
> general question which anyone considering employing the software is going
> to ask.
>
>
>
> The reply to suggest that folk head off a pay for a course when there are
> ‘pre-sales’ questions is not a practical response as any business is
> unlikely to be spending speculative money.
>
>
>
> *From:* Jeff Jirsa <jji...@gmail.com>
> *Sent:* Tuesday, July 12, 2022 4:43 PM
> *To:* cassandra <user@cassandra.apache.org>
> *Cc:* Bowen Song <bo...@bso.ng>
> *Subject:* Re: Adding nodes
>
>
>
> EXTERNAL
>
>
>
>
>
> On Tue, Jul 12, 2022 at 7:27 AM Marc Hoppins <marc.hopp...@eset.com>
> wrote:
>
>
>
> I was asking the questions but no one cared to answer.
>
>
>
> This is probably a combination of "it is really hard to answer a question
> with insufficient data" and your tone. Nobody here gets paid to help you
> solve your company's problems except you.
>
>
>
>
>
>
>
>
>

Reply via email to