Here is the stacktrace from the failure, it looks like it's trying to
gather all the columfamily metrics and going OOM. Is this just for the JMX
metrics?

https://github.com/apache/cassandra/blob/cassandra-2.1.16/src/java/org/apache/cassandra/metrics/ColumnFamilyMetrics.java

ERROR [MessagingService-Incoming-/10.133.33.57] 2018-09-06 15:43:19,280
CassandraDaemon.java:231 - Exception in thread
Thread[MessagingService-Incoming-/x.x.x.x,5,main]
java.lang.OutOfMemoryError: Java heap space
        at java.io.DataInputStream.<init>(DataInputStream.java:58)
~[na:1.8.0_151]
        at
org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:139)
~[apache-cassandra-2.1.16.jar:2.1.16]
        at
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:88)
~[apache-cassandra-2.1.16.jar:2.1.16]
ERROR [InternalResponseStage:1] 2018-09-06 15:43:19,281
CassandraDaemon.java:231 - Exception in thread
Thread[InternalResponseStage:1,5,main]
java.lang.OutOfMemoryError: Java heap space
        at
org.apache.cassandra.metrics.ColumnFamilyMetrics$AllColumnFamilyMetricNameFactory.createMetricName(
*ColumnFamilyMetrics.java:784*) ~[apache-cassandra-2.1.16.jar:2.1.16]
        at
org.apache.cassandra.metrics.ColumnFamilyMetrics.createColumnFamilyHistogram(ColumnFamilyMetrics.java:716)
~[apache-cassandra-2.1.16.jar:2.1.16]
        at
org.apache.cassandra.metrics.ColumnFamilyMetrics.<init>(ColumnFamilyMetrics.java:597)
~[apache-cassandra-2.1.16.jar:2.1.16]
        at
org.apache.cassandra.db.ColumnFamilyStore.<init>(ColumnFamilyStore.java:361)
~[apache-cassandra-2.1.16.jar:2.1.16]
        at
org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:527)
~[apache-cassandra-2.1.16.jar:2.1.16]
        at
org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:498)
~[apache-cassandra-2.1.16.jar:2.1.16]
        at org.apache.cassandra.db.Keyspace.initCf(Keyspace.java:335)
~[apache-cassandra-2.1.16.jar:2.1.16]
        at
org.apache.cassandra.db.DefsTables.addColumnFamily(DefsTables.java:385)
~[apache-cassandra-2.1.16.jar:2.1.16]
        at
org.apache.cassandra.db.DefsTables.mergeColumnFamilies(DefsTables.java:293)
~[apache-cassandra-2.1.16.jar:2.1.16]
        at
org.apache.cassandra.db.DefsTables.mergeSchemaInternal(DefsTables.java:194)
~[apache-cassandra-2.1.16.jar:2.1.16]
        at
org.apache.cassandra.db.DefsTables.mergeSchema(DefsTables.java:166)
~[apache-cassandra-2.1.16.jar:2.1.16]
        at
org.apache.cassandra.service.MigrationTask$1.response(MigrationTask.java:75)
~[apache-cassandra-2.1.16.jar:2.1.16]
        at
org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:54)
~[apache-cassandra-2.1.16.jar:2.1.16]
        at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:64)
~[apache-cassandra-2.1.16.jar:2.1.16]
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
~[na:1.8.0_151]
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
~[na:1.8.0_151]
        at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_151]

On Thu, Aug 30, 2018 at 12:51 PM Jai Bheemsen Rao Dhanwada <
jaibheem...@gmail.com> wrote:

> thank you
>
> On Thu, Aug 30, 2018 at 11:58 AM Jeff Jirsa <jji...@gmail.com> wrote:
>
>> This is the closest JIRA that comes to mind (from memory, I didn't
>> search, there may be others):
>> https://issues.apache.org/jira/browse/CASSANDRA-8150
>>
>> The best blog that's all in one place on tuning GC in cassandra is
>> actually Amy's 2.1 tuning guide:
>> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html - it's
>> somewhat out of date as it's for 2.1, but since that's what you're running,
>> that works out in your favor.
>>
>>
>>
>>
>>
>> On Thu, Aug 30, 2018 at 10:53 AM Jai Bheemsen Rao Dhanwada <
>> jaibheem...@gmail.com> wrote:
>>
>>> Hi Jeff,
>>>
>>> Is there any JIRA that talks about increasing the HEAP will help?
>>> Also, any other alternatives than increasing the HEAP Size? last time
>>> when I tried increasing the heap, longer GC Pauses caused more damage in
>>> terms of latencies while gc pause.
>>>
>>> On Wed, Aug 29, 2018 at 11:07 PM Jai Bheemsen Rao Dhanwada <
>>> jaibheem...@gmail.com> wrote:
>>>
>>>> okay, thank you
>>>>
>>>> On Wed, Aug 29, 2018 at 11:04 PM Jeff Jirsa <jji...@gmail.com> wrote:
>>>>
>>>>> You’re seeing an OOM, not a socket error / timeout.
>>>>>
>>>>> --
>>>>> Jeff Jirsa
>>>>>
>>>>>
>>>>> On Aug 29, 2018, at 10:56 PM, Jai Bheemsen Rao Dhanwada <
>>>>> jaibheem...@gmail.com> wrote:
>>>>>
>>>>> Jeff,
>>>>>
>>>>> any idea if this is somehow related to :
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-11840?
>>>>> does increasing the value of streaming_socket_timeout_in_ms to a
>>>>> higher value helps?
>>>>>
>>>>> On Wed, Aug 29, 2018 at 10:52 PM Jai Bheemsen Rao Dhanwada <
>>>>> jaibheem...@gmail.com> wrote:
>>>>>
>>>>>> I have 72 nodes in the cluster, across 8 datacenters.. the moment I
>>>>>> try to increase the node above 84 or so, the issue starts.
>>>>>>
>>>>>> I am still using CMS Heap, assuming it will create more harm if I
>>>>>> increase the heap size beyond 8G(recommended).
>>>>>>
>>>>>> On Wed, Aug 29, 2018 at 6:53 PM Jeff Jirsa <jji...@gmail.com> wrote:
>>>>>>
>>>>>>> Given the size of your schema, you’re probably getting flooded with
>>>>>>> a bunch of huge schema mutations as it hops into gossip and tries to 
>>>>>>> pull
>>>>>>> the schema from every host it sees. You say 8 DCs but you don’t say how
>>>>>>> many nodes - I’m guessing it’s  a lot?
>>>>>>>
>>>>>>> This is something that’s incrementally better in 3.0, but a real
>>>>>>> proper fix has been talked about a few times  -
>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-11748 and
>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-13569 for example
>>>>>>>
>>>>>>> In the short term, you may be able to work around this by increasing
>>>>>>> your heap size. If that doesn’t work, there’s an ugly ugly hack that’ll
>>>>>>> work on 2.1:  limiting the number of schema blobs you can get at a time 
>>>>>>> -
>>>>>>> in this case, that means firewall off all but a few nodes in your 
>>>>>>> cluster
>>>>>>> for 10-30 seconds, make sure it gets the schema (watch the logs or file
>>>>>>> system for the tables to be created), then remove the firewall so it can
>>>>>>> start the bootstrap process (it needs the schema to setup the streaming
>>>>>>> plan, and it needs all the hosts up in gossip to stream successfully, so
>>>>>>> this is an ugly hack to give you time to get the schema and then heal 
>>>>>>> the
>>>>>>> cluster so it can bootstrap).
>>>>>>>
>>>>>>> Yea that’s awful. Hopefully either of the two above JIRAs lands to
>>>>>>> make this less awful.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jeff Jirsa
>>>>>>>
>>>>>>>
>>>>>>> On Aug 29, 2018, at 6:29 PM, Jai Bheemsen Rao Dhanwada <
>>>>>>> jaibheem...@gmail.com> wrote:
>>>>>>>
>>>>>>> It fails before bootstrap
>>>>>>>
>>>>>>> streaming throughpu on the nodes is set to 400Mb/ps
>>>>>>>
>>>>>>> On Wednesday, August 29, 2018, Jeff Jirsa <jji...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Is the bootstrap plan succeeding (does streaming start or does it
>>>>>>>> crash before it logs messages about streaming starting)?
>>>>>>>>
>>>>>>>> Have you capped the stream throughput on the existing hosts?
>>>>>>>>
>>>>>>>> --
>>>>>>>> Jeff Jirsa
>>>>>>>>
>>>>>>>>
>>>>>>>> On Aug 29, 2018, at 5:02 PM, Jai Bheemsen Rao Dhanwada <
>>>>>>>> jaibheem...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Hello All,
>>>>>>>>
>>>>>>>> We are seeing some issue when we add more nodes to the cluster,
>>>>>>>> where new node bootstrap is not able to stream the entire metadata and
>>>>>>>> fails to bootstrap. Finally the process dies with OOM 
>>>>>>>> (java.lang.OutOfMemoryError:
>>>>>>>> Java heap space)
>>>>>>>>
>>>>>>>> But if I remove few nodes from the cluster we don't see this issue.
>>>>>>>>
>>>>>>>> Cassandra Version: 2.1.16
>>>>>>>> # of KS and CF : 100, 3000 (approx)
>>>>>>>> # of DC: 8
>>>>>>>> # of Vnodes per node: 256
>>>>>>>>
>>>>>>>> Not sure what is causing this behavior, has any one come across
>>>>>>>> this scenario?
>>>>>>>> thanks in advance.
>>>>>>>>
>>>>>>>>

Reply via email to