Here is the stacktrace from the failure, it looks like it's trying to gather all the columfamily metrics and going OOM. Is this just for the JMX metrics?
https://github.com/apache/cassandra/blob/cassandra-2.1.16/src/java/org/apache/cassandra/metrics/ColumnFamilyMetrics.java ERROR [MessagingService-Incoming-/10.133.33.57] 2018-09-06 15:43:19,280 CassandraDaemon.java:231 - Exception in thread Thread[MessagingService-Incoming-/x.x.x.x,5,main] java.lang.OutOfMemoryError: Java heap space at java.io.DataInputStream.<init>(DataInputStream.java:58) ~[na:1.8.0_151] at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:139) ~[apache-cassandra-2.1.16.jar:2.1.16] at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:88) ~[apache-cassandra-2.1.16.jar:2.1.16] ERROR [InternalResponseStage:1] 2018-09-06 15:43:19,281 CassandraDaemon.java:231 - Exception in thread Thread[InternalResponseStage:1,5,main] java.lang.OutOfMemoryError: Java heap space at org.apache.cassandra.metrics.ColumnFamilyMetrics$AllColumnFamilyMetricNameFactory.createMetricName( *ColumnFamilyMetrics.java:784*) ~[apache-cassandra-2.1.16.jar:2.1.16] at org.apache.cassandra.metrics.ColumnFamilyMetrics.createColumnFamilyHistogram(ColumnFamilyMetrics.java:716) ~[apache-cassandra-2.1.16.jar:2.1.16] at org.apache.cassandra.metrics.ColumnFamilyMetrics.<init>(ColumnFamilyMetrics.java:597) ~[apache-cassandra-2.1.16.jar:2.1.16] at org.apache.cassandra.db.ColumnFamilyStore.<init>(ColumnFamilyStore.java:361) ~[apache-cassandra-2.1.16.jar:2.1.16] at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:527) ~[apache-cassandra-2.1.16.jar:2.1.16] at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:498) ~[apache-cassandra-2.1.16.jar:2.1.16] at org.apache.cassandra.db.Keyspace.initCf(Keyspace.java:335) ~[apache-cassandra-2.1.16.jar:2.1.16] at org.apache.cassandra.db.DefsTables.addColumnFamily(DefsTables.java:385) ~[apache-cassandra-2.1.16.jar:2.1.16] at org.apache.cassandra.db.DefsTables.mergeColumnFamilies(DefsTables.java:293) ~[apache-cassandra-2.1.16.jar:2.1.16] at org.apache.cassandra.db.DefsTables.mergeSchemaInternal(DefsTables.java:194) ~[apache-cassandra-2.1.16.jar:2.1.16] at org.apache.cassandra.db.DefsTables.mergeSchema(DefsTables.java:166) ~[apache-cassandra-2.1.16.jar:2.1.16] at org.apache.cassandra.service.MigrationTask$1.response(MigrationTask.java:75) ~[apache-cassandra-2.1.16.jar:2.1.16] at org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:54) ~[apache-cassandra-2.1.16.jar:2.1.16] at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:64) ~[apache-cassandra-2.1.16.jar:2.1.16] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_151] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_151] at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_151] On Thu, Aug 30, 2018 at 12:51 PM Jai Bheemsen Rao Dhanwada < jaibheem...@gmail.com> wrote: > thank you > > On Thu, Aug 30, 2018 at 11:58 AM Jeff Jirsa <jji...@gmail.com> wrote: > >> This is the closest JIRA that comes to mind (from memory, I didn't >> search, there may be others): >> https://issues.apache.org/jira/browse/CASSANDRA-8150 >> >> The best blog that's all in one place on tuning GC in cassandra is >> actually Amy's 2.1 tuning guide: >> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html - it's >> somewhat out of date as it's for 2.1, but since that's what you're running, >> that works out in your favor. >> >> >> >> >> >> On Thu, Aug 30, 2018 at 10:53 AM Jai Bheemsen Rao Dhanwada < >> jaibheem...@gmail.com> wrote: >> >>> Hi Jeff, >>> >>> Is there any JIRA that talks about increasing the HEAP will help? >>> Also, any other alternatives than increasing the HEAP Size? last time >>> when I tried increasing the heap, longer GC Pauses caused more damage in >>> terms of latencies while gc pause. >>> >>> On Wed, Aug 29, 2018 at 11:07 PM Jai Bheemsen Rao Dhanwada < >>> jaibheem...@gmail.com> wrote: >>> >>>> okay, thank you >>>> >>>> On Wed, Aug 29, 2018 at 11:04 PM Jeff Jirsa <jji...@gmail.com> wrote: >>>> >>>>> You’re seeing an OOM, not a socket error / timeout. >>>>> >>>>> -- >>>>> Jeff Jirsa >>>>> >>>>> >>>>> On Aug 29, 2018, at 10:56 PM, Jai Bheemsen Rao Dhanwada < >>>>> jaibheem...@gmail.com> wrote: >>>>> >>>>> Jeff, >>>>> >>>>> any idea if this is somehow related to : >>>>> https://issues.apache.org/jira/browse/CASSANDRA-11840? >>>>> does increasing the value of streaming_socket_timeout_in_ms to a >>>>> higher value helps? >>>>> >>>>> On Wed, Aug 29, 2018 at 10:52 PM Jai Bheemsen Rao Dhanwada < >>>>> jaibheem...@gmail.com> wrote: >>>>> >>>>>> I have 72 nodes in the cluster, across 8 datacenters.. the moment I >>>>>> try to increase the node above 84 or so, the issue starts. >>>>>> >>>>>> I am still using CMS Heap, assuming it will create more harm if I >>>>>> increase the heap size beyond 8G(recommended). >>>>>> >>>>>> On Wed, Aug 29, 2018 at 6:53 PM Jeff Jirsa <jji...@gmail.com> wrote: >>>>>> >>>>>>> Given the size of your schema, you’re probably getting flooded with >>>>>>> a bunch of huge schema mutations as it hops into gossip and tries to >>>>>>> pull >>>>>>> the schema from every host it sees. You say 8 DCs but you don’t say how >>>>>>> many nodes - I’m guessing it’s a lot? >>>>>>> >>>>>>> This is something that’s incrementally better in 3.0, but a real >>>>>>> proper fix has been talked about a few times - >>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-11748 and >>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-13569 for example >>>>>>> >>>>>>> In the short term, you may be able to work around this by increasing >>>>>>> your heap size. If that doesn’t work, there’s an ugly ugly hack that’ll >>>>>>> work on 2.1: limiting the number of schema blobs you can get at a time >>>>>>> - >>>>>>> in this case, that means firewall off all but a few nodes in your >>>>>>> cluster >>>>>>> for 10-30 seconds, make sure it gets the schema (watch the logs or file >>>>>>> system for the tables to be created), then remove the firewall so it can >>>>>>> start the bootstrap process (it needs the schema to setup the streaming >>>>>>> plan, and it needs all the hosts up in gossip to stream successfully, so >>>>>>> this is an ugly hack to give you time to get the schema and then heal >>>>>>> the >>>>>>> cluster so it can bootstrap). >>>>>>> >>>>>>> Yea that’s awful. Hopefully either of the two above JIRAs lands to >>>>>>> make this less awful. >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Jeff Jirsa >>>>>>> >>>>>>> >>>>>>> On Aug 29, 2018, at 6:29 PM, Jai Bheemsen Rao Dhanwada < >>>>>>> jaibheem...@gmail.com> wrote: >>>>>>> >>>>>>> It fails before bootstrap >>>>>>> >>>>>>> streaming throughpu on the nodes is set to 400Mb/ps >>>>>>> >>>>>>> On Wednesday, August 29, 2018, Jeff Jirsa <jji...@gmail.com> wrote: >>>>>>> >>>>>>>> Is the bootstrap plan succeeding (does streaming start or does it >>>>>>>> crash before it logs messages about streaming starting)? >>>>>>>> >>>>>>>> Have you capped the stream throughput on the existing hosts? >>>>>>>> >>>>>>>> -- >>>>>>>> Jeff Jirsa >>>>>>>> >>>>>>>> >>>>>>>> On Aug 29, 2018, at 5:02 PM, Jai Bheemsen Rao Dhanwada < >>>>>>>> jaibheem...@gmail.com> wrote: >>>>>>>> >>>>>>>> Hello All, >>>>>>>> >>>>>>>> We are seeing some issue when we add more nodes to the cluster, >>>>>>>> where new node bootstrap is not able to stream the entire metadata and >>>>>>>> fails to bootstrap. Finally the process dies with OOM >>>>>>>> (java.lang.OutOfMemoryError: >>>>>>>> Java heap space) >>>>>>>> >>>>>>>> But if I remove few nodes from the cluster we don't see this issue. >>>>>>>> >>>>>>>> Cassandra Version: 2.1.16 >>>>>>>> # of KS and CF : 100, 3000 (approx) >>>>>>>> # of DC: 8 >>>>>>>> # of Vnodes per node: 256 >>>>>>>> >>>>>>>> Not sure what is causing this behavior, has any one come across >>>>>>>> this scenario? >>>>>>>> thanks in advance. >>>>>>>> >>>>>>>>