What version are you running? On Fri, Apr 24, 2015 at 12:51 AM, Thomas Miller <thomas.mil...@wda.com> wrote:
> Jeff, > > > > Thanks for the response. I had come across that as a possible solution > previously but there are discrepancies that would lead me to think that > that is not the issue. > > > > It appears our stream throughput is currently set to 200Mbps but unless > the Cassandra service shares that same throughput limitation to serve its > data also, it does not seem like 200Mbps bandwidth usage would overwhelm > the nodes. The 200Mbps bandwidth usage is only on two of the four nodes > when adding the new node. It seems like the other two nodes should be able > to handle requests still. When my backups run at night they hit around > 300Mbps bandwidth usage and we have no timeouts at all. > > > > Then there is the question of why, when we stopped the Cassandra service > on the joining node, the timeouts did not stop? Opscenter did not show that > node anymore and “nodetool status” verified that. We were thinking that > maybe gossip caused the existing nodes to think that there was still a node > joining but since the new node was shutdown it was not actually joining, > but that is not confirmed. > > > > > > Thanks, > > Thomas Miller > > > > *From:* Jeff Ferland [mailto:j...@tubularlabs.com] > *Sent:* Thursday, April 23, 2015 2:46 PM > *To:* user@cassandra.apache.org > *Subject:* Re: Adding New Node Issue > > > > Sounds to me like your stream throughput value is too high. `notetool > getstreamthroughput` and `notetool setstreamthroughput` will update this > value live. Limit it to something lower so that the system isn’t overloaded > by streaming. The bottleneck that slows things down is mostly to be disk or > network. > > > > On Apr 23, 2015, at 11:18 AM, Thomas Miller <thomas.mil...@wda.com> wrote: > > > > Hello, > > > > Yesterday we ran into a serious issue while joining a new node to our > existing 4 node Cassandra cluster (version 2.0.7). The average node data > size is 152GB’s with a replication factor of 3. The node was prepped just > like the following document describes - > http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_node_to_cluster_t.html > . > > > > When I started the new node, Opscenter showed the node as “Active – > Joining” but we immediately began getting timeouts on our websites because > lookups were taking too long. On the 4 existing nodes the network interface > showed about 200Mbps being used, the CPU never went over 20% and the memory > usage barely changed. > > > > The question I have is, does adding a new node cause some sort of > throttling that would affect our webservers from being able to function as > normal? The only thing that we can think of that might have had some affect > was that a repair was just finishing on one of the nodes when the new node > was added. The repair ended up finishing while the new node was in the > joining state but the timeouts did not go away afterwards. > > > > Our impatience got the better of us so we ended up stopping the Cassandra > service on the new node because it appeared, at the time, to have stalled > out in the joining state and nothing more was being streamed to it. But > even stopping it did not allow the cluster to resume its normal operation > and we were still getting timeouts. We tried rebooting our web servers and > then our 4 existing Cassandra servers but none of it worked. > > > > We never saw any errors/exceptions in the Cassandra and system logs at > all. It completely mystified us why there would be no errors/exceptions > unless this was working as intended. > > > > We ended up getting it working by adding the new node again and just > letting it go until it finally finished joining, and everything magically > started working again. We noticed towards the end it was barely streaming > anything (Opscenter was not showing any running streams towards the end) by > checking the size of the data directory and we saw it growing and shrinking > ever so slightly. > > > > We have to add one more new node and then decommission two of the existing > nodes so we can perform some hardware maintenance on the server those two > existing nodes are on, but we are hesitant to try this again without > scheduling a maintenance window for this node add and decommissioning > process. > > > > So to reiterate what I am asking, does adding a node cause the cluster to > be unusable/timeout? Also, can we expect the decommissioning of the other > two nodes to cause the same type of downtimes since they have to stream > their content out to the other nodes in the cluster? > > > > Thanks, > > Thomas Miller > > >