Re: Adding New Node Issue

Ali Akhtar Thu, 23 Apr 2015 13:24:18 -0700

What version are you running?

On Fri, Apr 24, 2015 at 12:51 AM, Thomas Miller <thomas.mil...@wda.com>
wrote:


> Jeff,
>
>
>
> Thanks for the response. I had come across that as a possible solution
> previously but there are discrepancies that would lead me to think that
> that is not the issue.
>
>
>
> It appears our stream throughput is currently set to 200Mbps but unless
> the Cassandra service shares that same throughput limitation to serve its
> data also, it does not seem like 200Mbps bandwidth usage would overwhelm
> the nodes. The 200Mbps bandwidth usage is only on two of the four nodes
> when adding the new node. It seems like the other two nodes should be able
> to handle requests still. When my backups run at night they hit around
> 300Mbps bandwidth usage and we have no timeouts at all.
>
>
>
> Then there is the question of why, when we stopped the Cassandra service
> on the joining node, the timeouts did not stop? Opscenter did not show that
> node anymore and “nodetool status” verified that. We were thinking that
> maybe gossip caused the existing nodes to think that there was still a node
> joining but since the new node was shutdown it was not actually joining,
> but that is not confirmed.
>
>
>
>
>
> Thanks,
>
> Thomas Miller
>
>
>
> *From:* Jeff Ferland [mailto:j...@tubularlabs.com]
> *Sent:* Thursday, April 23, 2015 2:46 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Adding New Node Issue
>
>
>
> Sounds to me like your stream throughput value is too high. `notetool
> getstreamthroughput` and `notetool setstreamthroughput` will update this
> value live. Limit it to something lower so that the system isn’t overloaded
> by streaming. The bottleneck that slows things down is mostly to be disk or
> network.
>
>
>
> On Apr 23, 2015, at 11:18 AM, Thomas Miller <thomas.mil...@wda.com> wrote:
>
>
>
> Hello,
>
>
>
> Yesterday we ran into a serious issue while joining a new node to our
> existing 4 node Cassandra cluster (version 2.0.7). The average node data
> size is 152GB’s with a replication factor of 3. The node was prepped just
> like the following document describes -
> http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_node_to_cluster_t.html
> .
>
>
>
> When I started the new node, Opscenter showed the node as “Active –
> Joining” but we immediately began getting timeouts on our websites because
> lookups were taking too long. On the 4 existing nodes the network interface
> showed about 200Mbps being used, the CPU never went over 20% and the memory
> usage barely changed.
>
>
>
> The question I have is, does adding a new node cause some sort of
> throttling that would affect our webservers from being able to function as
> normal? The only thing that we can think of that might have had some affect
> was that a repair was just finishing on one of the nodes when the new node
> was added. The repair ended up finishing while the new node was in the
> joining state but the timeouts did not go away afterwards.
>
>
>
> Our impatience got the better of us so we ended up stopping the Cassandra
> service on the new node because it appeared, at the time, to have stalled
> out in the joining state and nothing more was being streamed to it. But
> even stopping it did not allow the cluster to resume its normal operation
> and we were still getting timeouts. We tried rebooting our web servers and
> then our 4 existing Cassandra servers but none of it worked.
>
>
>
> We never saw any errors/exceptions in the Cassandra and system logs at
> all. It completely mystified us why there would be no errors/exceptions
> unless this was working as intended.
>
>
>
> We ended up getting it working by adding the new node again and just
> letting it go until it finally finished joining, and everything magically
> started working again. We noticed towards the end it was barely streaming
> anything (Opscenter was not showing any running streams towards the end) by
> checking the size of the data directory and we saw it growing and shrinking
> ever so slightly.
>
>
>
> We have to add one more new node and then decommission two of the existing
> nodes so we can perform some hardware maintenance on the server those two
> existing nodes are on, but we are hesitant to try this again without
> scheduling a maintenance window for this node add and decommissioning
> process.
>
>
>
> So to reiterate what I am asking, does adding a node cause the cluster to
> be unusable/timeout? Also, can we expect the decommissioning of the other
> two nodes to cause the same type of downtimes since they have to stream
> their content out to the other nodes in the cluster?
>
>
>
> Thanks,
>
> Thomas Miller
>
>
>

Re: Adding New Node Issue

Reply via email to