Re: Adding New Node Issue

Andrei Ivanov Thu, 23 Apr 2015 13:40:52 -0700

Thomas, just in case you missed it there is a bug with throughput setting
prior to 2.0.13, here is the link:
https://issues.apache.org/jira/browse/CASSANDRA-8852


So, it may happen you are setting it to 1600 megabytes

Andrei

On Thu, Apr 23, 2015 at 11:22 PM, Ali Akhtar <ali.rac...@gmail.com> wrote:

> What version are you running?
>
> On Fri, Apr 24, 2015 at 12:51 AM, Thomas Miller <thomas.mil...@wda.com>
> wrote:
>
>> Jeff,
>>
>>
>>
>> Thanks for the response. I had come across that as a possible solution
>> previously but there are discrepancies that would lead me to think that
>> that is not the issue.
>>
>>
>>
>> It appears our stream throughput is currently set to 200Mbps but unless
>> the Cassandra service shares that same throughput limitation to serve its
>> data also, it does not seem like 200Mbps bandwidth usage would overwhelm
>> the nodes. The 200Mbps bandwidth usage is only on two of the four nodes
>> when adding the new node. It seems like the other two nodes should be able
>> to handle requests still. When my backups run at night they hit around
>> 300Mbps bandwidth usage and we have no timeouts at all.
>>
>>
>>
>> Then there is the question of why, when we stopped the Cassandra service
>> on the joining node, the timeouts did not stop? Opscenter did not show that
>> node anymore and “nodetool status” verified that. We were thinking that
>> maybe gossip caused the existing nodes to think that there was still a node
>> joining but since the new node was shutdown it was not actually joining,
>> but that is not confirmed.
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Thomas Miller
>>
>>
>>
>> *From:* Jeff Ferland [mailto:j...@tubularlabs.com]
>> *Sent:* Thursday, April 23, 2015 2:46 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Adding New Node Issue
>>
>>
>>
>> Sounds to me like your stream throughput value is too high. `notetool
>> getstreamthroughput` and `notetool setstreamthroughput` will update this
>> value live. Limit it to something lower so that the system isn’t overloaded
>> by streaming. The bottleneck that slows things down is mostly to be disk or
>> network.
>>
>>
>>
>> On Apr 23, 2015, at 11:18 AM, Thomas Miller <thomas.mil...@wda.com>
>> wrote:
>>
>>
>>
>> Hello,
>>
>>
>>
>> Yesterday we ran into a serious issue while joining a new node to our
>> existing 4 node Cassandra cluster (version 2.0.7). The average node data
>> size is 152GB’s with a replication factor of 3. The node was prepped just
>> like the following document describes -
>> http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_node_to_cluster_t.html
>> .
>>
>>
>>
>> When I started the new node, Opscenter showed the node as “Active –
>> Joining” but we immediately began getting timeouts on our websites because
>> lookups were taking too long. On the 4 existing nodes the network interface
>> showed about 200Mbps being used, the CPU never went over 20% and the memory
>> usage barely changed.
>>
>>
>>
>> The question I have is, does adding a new node cause some sort of
>> throttling that would affect our webservers from being able to function as
>> normal? The only thing that we can think of that might have had some affect
>> was that a repair was just finishing on one of the nodes when the new node
>> was added. The repair ended up finishing while the new node was in the
>> joining state but the timeouts did not go away afterwards.
>>
>>
>>
>> Our impatience got the better of us so we ended up stopping the Cassandra
>> service on the new node because it appeared, at the time, to have stalled
>> out in the joining state and nothing more was being streamed to it. But
>> even stopping it did not allow the cluster to resume its normal operation
>> and we were still getting timeouts. We tried rebooting our web servers and
>> then our 4 existing Cassandra servers but none of it worked.
>>
>>
>>
>> We never saw any errors/exceptions in the Cassandra and system logs at
>> all. It completely mystified us why there would be no errors/exceptions
>> unless this was working as intended.
>>
>>
>>
>> We ended up getting it working by adding the new node again and just
>> letting it go until it finally finished joining, and everything magically
>> started working again. We noticed towards the end it was barely streaming
>> anything (Opscenter was not showing any running streams towards the end) by
>> checking the size of the data directory and we saw it growing and shrinking
>> ever so slightly.
>>
>>
>>
>> We have to add one more new node and then decommission two of the
>> existing nodes so we can perform some hardware maintenance on the server
>> those two existing nodes are on, but we are hesitant to try this again
>> without scheduling a maintenance window for this node add and
>> decommissioning process.
>>
>>
>>
>> So to reiterate what I am asking, does adding a node cause the cluster to
>> be unusable/timeout? Also, can we expect the decommissioning of the other
>> two nodes to cause the same type of downtimes since they have to stream
>> their content out to the other nodes in the cluster?
>>
>>
>>
>> Thanks,
>>
>> Thomas Miller
>>
>>
>>
>
>

Re: Adding New Node Issue

Reply via email to