Jeff,

Thanks for the response. I had come across that as a possible solution 
previously but there are discrepancies that would lead me to think that that is 
not the issue.

It appears our stream throughput is currently set to 200Mbps but unless the 
Cassandra service shares that same throughput limitation to serve its data 
also, it does not seem like 200Mbps bandwidth usage would overwhelm the nodes. 
The 200Mbps bandwidth usage is only on two of the four nodes when adding the 
new node. It seems like the other two nodes should be able to handle requests 
still. When my backups run at night they hit around 300Mbps bandwidth usage and 
we have no timeouts at all.

Then there is the question of why, when we stopped the Cassandra service on the 
joining node, the timeouts did not stop? Opscenter did not show that node 
anymore and “nodetool status” verified that. We were thinking that maybe gossip 
caused the existing nodes to think that there was still a node joining but 
since the new node was shutdown it was not actually joining, but that is not 
confirmed.


Thanks,
Thomas Miller

From: Jeff Ferland [mailto:j...@tubularlabs.com]
Sent: Thursday, April 23, 2015 2:46 PM
To: user@cassandra.apache.org
Subject: Re: Adding New Node Issue

Sounds to me like your stream throughput value is too high. `notetool 
getstreamthroughput` and `notetool setstreamthroughput` will update this value 
live. Limit it to something lower so that the system isn’t overloaded by 
streaming. The bottleneck that slows things down is mostly to be disk or 
network.

On Apr 23, 2015, at 11:18 AM, Thomas Miller 
<thomas.mil...@wda.com<mailto:thomas.mil...@wda.com>> wrote:

Hello,

Yesterday we ran into a serious issue while joining a new node to our existing 
4 node Cassandra cluster (version 2.0.7). The average node data size is 152GB’s 
with a replication factor of 3. The node was prepped just like the following 
document describes - 
http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_node_to_cluster_t.html.

When I started the new node, Opscenter showed the node as “Active – Joining” 
but we immediately began getting timeouts on our websites because lookups were 
taking too long. On the 4 existing nodes the network interface showed about 
200Mbps being used, the CPU never went over 20% and the memory usage barely 
changed.

The question I have is, does adding a new node cause some sort of throttling 
that would affect our webservers from being able to function as normal? The 
only thing that we can think of that might have had some affect was that a 
repair was just finishing on one of the nodes when the new node was added. The 
repair ended up finishing while the new node was in the joining state but the 
timeouts did not go away afterwards.

Our impatience got the better of us so we ended up stopping the Cassandra 
service on the new node because it appeared, at the time, to have stalled out 
in the joining state and nothing more was being streamed to it. But even 
stopping it did not allow the cluster to resume its normal operation and we 
were still getting timeouts. We tried rebooting our web servers and then our 4 
existing Cassandra servers but none of it worked.

We never saw any errors/exceptions in the Cassandra and system logs at all. It 
completely mystified us why there would be no errors/exceptions unless this was 
working as intended.

We ended up getting it working by adding the new node again and just letting it 
go until it finally finished joining, and everything magically started working 
again. We noticed towards the end it was barely streaming anything (Opscenter 
was not showing any running streams towards the end) by checking the size of 
the data directory and we saw it growing and shrinking ever so slightly.

We have to add one more new node and then decommission two of the existing 
nodes so we can perform some hardware maintenance on the server those two 
existing nodes are on, but we are hesitant to try this again without scheduling 
a maintenance window for this node add and decommissioning process.

So to reiterate what I am asking, does adding a node cause the cluster to be 
unusable/timeout? Also, can we expect the decommissioning of the other two 
nodes to cause the same type of downtimes since they have to stream their 
content out to the other nodes in the cluster?

Thanks,
Thomas Miller

Reply via email to