Thomas, just in case you missed it there is a bug with throughput setting prior to 2.0.13, here is the link: https://issues.apache.org/jira/browse/CASSANDRA-8852
So, it may happen you are setting it to 1600 megabytes Andrei On Thu, Apr 23, 2015 at 11:22 PM, Ali Akhtar <ali.rac...@gmail.com> wrote: > What version are you running? > > On Fri, Apr 24, 2015 at 12:51 AM, Thomas Miller <thomas.mil...@wda.com> > wrote: > >> Jeff, >> >> >> >> Thanks for the response. I had come across that as a possible solution >> previously but there are discrepancies that would lead me to think that >> that is not the issue. >> >> >> >> It appears our stream throughput is currently set to 200Mbps but unless >> the Cassandra service shares that same throughput limitation to serve its >> data also, it does not seem like 200Mbps bandwidth usage would overwhelm >> the nodes. The 200Mbps bandwidth usage is only on two of the four nodes >> when adding the new node. It seems like the other two nodes should be able >> to handle requests still. When my backups run at night they hit around >> 300Mbps bandwidth usage and we have no timeouts at all. >> >> >> >> Then there is the question of why, when we stopped the Cassandra service >> on the joining node, the timeouts did not stop? Opscenter did not show that >> node anymore and “nodetool status” verified that. We were thinking that >> maybe gossip caused the existing nodes to think that there was still a node >> joining but since the new node was shutdown it was not actually joining, >> but that is not confirmed. >> >> >> >> >> >> Thanks, >> >> Thomas Miller >> >> >> >> *From:* Jeff Ferland [mailto:j...@tubularlabs.com] >> *Sent:* Thursday, April 23, 2015 2:46 PM >> *To:* user@cassandra.apache.org >> *Subject:* Re: Adding New Node Issue >> >> >> >> Sounds to me like your stream throughput value is too high. `notetool >> getstreamthroughput` and `notetool setstreamthroughput` will update this >> value live. Limit it to something lower so that the system isn’t overloaded >> by streaming. The bottleneck that slows things down is mostly to be disk or >> network. >> >> >> >> On Apr 23, 2015, at 11:18 AM, Thomas Miller <thomas.mil...@wda.com> >> wrote: >> >> >> >> Hello, >> >> >> >> Yesterday we ran into a serious issue while joining a new node to our >> existing 4 node Cassandra cluster (version 2.0.7). The average node data >> size is 152GB’s with a replication factor of 3. The node was prepped just >> like the following document describes - >> http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_node_to_cluster_t.html >> . >> >> >> >> When I started the new node, Opscenter showed the node as “Active – >> Joining” but we immediately began getting timeouts on our websites because >> lookups were taking too long. On the 4 existing nodes the network interface >> showed about 200Mbps being used, the CPU never went over 20% and the memory >> usage barely changed. >> >> >> >> The question I have is, does adding a new node cause some sort of >> throttling that would affect our webservers from being able to function as >> normal? The only thing that we can think of that might have had some affect >> was that a repair was just finishing on one of the nodes when the new node >> was added. The repair ended up finishing while the new node was in the >> joining state but the timeouts did not go away afterwards. >> >> >> >> Our impatience got the better of us so we ended up stopping the Cassandra >> service on the new node because it appeared, at the time, to have stalled >> out in the joining state and nothing more was being streamed to it. But >> even stopping it did not allow the cluster to resume its normal operation >> and we were still getting timeouts. We tried rebooting our web servers and >> then our 4 existing Cassandra servers but none of it worked. >> >> >> >> We never saw any errors/exceptions in the Cassandra and system logs at >> all. It completely mystified us why there would be no errors/exceptions >> unless this was working as intended. >> >> >> >> We ended up getting it working by adding the new node again and just >> letting it go until it finally finished joining, and everything magically >> started working again. We noticed towards the end it was barely streaming >> anything (Opscenter was not showing any running streams towards the end) by >> checking the size of the data directory and we saw it growing and shrinking >> ever so slightly. >> >> >> >> We have to add one more new node and then decommission two of the >> existing nodes so we can perform some hardware maintenance on the server >> those two existing nodes are on, but we are hesitant to try this again >> without scheduling a maintenance window for this node add and >> decommissioning process. >> >> >> >> So to reiterate what I am asking, does adding a node cause the cluster to >> be unusable/timeout? Also, can we expect the decommissioning of the other >> two nodes to cause the same type of downtimes since they have to stream >> their content out to the other nodes in the cluster? >> >> >> >> Thanks, >> >> Thomas Miller >> >> >> > >