Did anyone run into similar issues? On Thu, Sep 6, 2018 at 10:27 AM Jai Bheemsen Rao Dhanwada < jaibheem...@gmail.com> wrote:
> Here is the stacktrace from the failure, it looks like it's trying to > gather all the columfamily metrics and going OOM. Is this just for the JMX > metrics? > > > https://github.com/apache/cassandra/blob/cassandra-2.1.16/src/java/org/apache/cassandra/metrics/ColumnFamilyMetrics.java > > ERROR [MessagingService-Incoming-/10.133.33.57] 2018-09-06 15:43:19,280 > CassandraDaemon.java:231 - Exception in thread > Thread[MessagingService-Incoming-/x.x.x.x,5,main] > java.lang.OutOfMemoryError: Java heap space > at java.io.DataInputStream.<init>(DataInputStream.java:58) > ~[na:1.8.0_151] > at > org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:139) > ~[apache-cassandra-2.1.16.jar:2.1.16] > at > org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:88) > ~[apache-cassandra-2.1.16.jar:2.1.16] > ERROR [InternalResponseStage:1] 2018-09-06 15:43:19,281 > CassandraDaemon.java:231 - Exception in thread > Thread[InternalResponseStage:1,5,main] > java.lang.OutOfMemoryError: Java heap space > at > org.apache.cassandra.metrics.ColumnFamilyMetrics$AllColumnFamilyMetricNameFactory.createMetricName( > *ColumnFamilyMetrics.java:784*) ~[apache-cassandra-2.1.16.jar:2.1.16] > at > org.apache.cassandra.metrics.ColumnFamilyMetrics.createColumnFamilyHistogram(ColumnFamilyMetrics.java:716) > ~[apache-cassandra-2.1.16.jar:2.1.16] > at > org.apache.cassandra.metrics.ColumnFamilyMetrics.<init>(ColumnFamilyMetrics.java:597) > ~[apache-cassandra-2.1.16.jar:2.1.16] > at > org.apache.cassandra.db.ColumnFamilyStore.<init>(ColumnFamilyStore.java:361) > ~[apache-cassandra-2.1.16.jar:2.1.16] > at > org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:527) > ~[apache-cassandra-2.1.16.jar:2.1.16] > at > org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:498) > ~[apache-cassandra-2.1.16.jar:2.1.16] > at org.apache.cassandra.db.Keyspace.initCf(Keyspace.java:335) > ~[apache-cassandra-2.1.16.jar:2.1.16] > at > org.apache.cassandra.db.DefsTables.addColumnFamily(DefsTables.java:385) > ~[apache-cassandra-2.1.16.jar:2.1.16] > at > org.apache.cassandra.db.DefsTables.mergeColumnFamilies(DefsTables.java:293) > ~[apache-cassandra-2.1.16.jar:2.1.16] > at > org.apache.cassandra.db.DefsTables.mergeSchemaInternal(DefsTables.java:194) > ~[apache-cassandra-2.1.16.jar:2.1.16] > at > org.apache.cassandra.db.DefsTables.mergeSchema(DefsTables.java:166) > ~[apache-cassandra-2.1.16.jar:2.1.16] > at > org.apache.cassandra.service.MigrationTask$1.response(MigrationTask.java:75) > ~[apache-cassandra-2.1.16.jar:2.1.16] > at > org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:54) > ~[apache-cassandra-2.1.16.jar:2.1.16] > at > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:64) > ~[apache-cassandra-2.1.16.jar:2.1.16] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > ~[na:1.8.0_151] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > ~[na:1.8.0_151] > at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_151] > > On Thu, Aug 30, 2018 at 12:51 PM Jai Bheemsen Rao Dhanwada < > jaibheem...@gmail.com> wrote: > >> thank you >> >> On Thu, Aug 30, 2018 at 11:58 AM Jeff Jirsa <jji...@gmail.com> wrote: >> >>> This is the closest JIRA that comes to mind (from memory, I didn't >>> search, there may be others): >>> https://issues.apache.org/jira/browse/CASSANDRA-8150 >>> >>> The best blog that's all in one place on tuning GC in cassandra is >>> actually Amy's 2.1 tuning guide: >>> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html - >>> it's somewhat out of date as it's for 2.1, but since that's what you're >>> running, that works out in your favor. >>> >>> >>> >>> >>> >>> On Thu, Aug 30, 2018 at 10:53 AM Jai Bheemsen Rao Dhanwada < >>> jaibheem...@gmail.com> wrote: >>> >>>> Hi Jeff, >>>> >>>> Is there any JIRA that talks about increasing the HEAP will help? >>>> Also, any other alternatives than increasing the HEAP Size? last time >>>> when I tried increasing the heap, longer GC Pauses caused more damage in >>>> terms of latencies while gc pause. >>>> >>>> On Wed, Aug 29, 2018 at 11:07 PM Jai Bheemsen Rao Dhanwada < >>>> jaibheem...@gmail.com> wrote: >>>> >>>>> okay, thank you >>>>> >>>>> On Wed, Aug 29, 2018 at 11:04 PM Jeff Jirsa <jji...@gmail.com> wrote: >>>>> >>>>>> You’re seeing an OOM, not a socket error / timeout. >>>>>> >>>>>> -- >>>>>> Jeff Jirsa >>>>>> >>>>>> >>>>>> On Aug 29, 2018, at 10:56 PM, Jai Bheemsen Rao Dhanwada < >>>>>> jaibheem...@gmail.com> wrote: >>>>>> >>>>>> Jeff, >>>>>> >>>>>> any idea if this is somehow related to : >>>>>> https://issues.apache.org/jira/browse/CASSANDRA-11840? >>>>>> does increasing the value of streaming_socket_timeout_in_ms to a >>>>>> higher value helps? >>>>>> >>>>>> On Wed, Aug 29, 2018 at 10:52 PM Jai Bheemsen Rao Dhanwada < >>>>>> jaibheem...@gmail.com> wrote: >>>>>> >>>>>>> I have 72 nodes in the cluster, across 8 datacenters.. the moment I >>>>>>> try to increase the node above 84 or so, the issue starts. >>>>>>> >>>>>>> I am still using CMS Heap, assuming it will create more harm if I >>>>>>> increase the heap size beyond 8G(recommended). >>>>>>> >>>>>>> On Wed, Aug 29, 2018 at 6:53 PM Jeff Jirsa <jji...@gmail.com> wrote: >>>>>>> >>>>>>>> Given the size of your schema, you’re probably getting flooded with >>>>>>>> a bunch of huge schema mutations as it hops into gossip and tries to >>>>>>>> pull >>>>>>>> the schema from every host it sees. You say 8 DCs but you don’t say how >>>>>>>> many nodes - I’m guessing it’s a lot? >>>>>>>> >>>>>>>> This is something that’s incrementally better in 3.0, but a real >>>>>>>> proper fix has been talked about a few times - >>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-11748 and >>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-13569 for example >>>>>>>> >>>>>>>> In the short term, you may be able to work around this by >>>>>>>> increasing your heap size. If that doesn’t work, there’s an ugly ugly >>>>>>>> hack >>>>>>>> that’ll work on 2.1: limiting the number of schema blobs you can get >>>>>>>> at a >>>>>>>> time - in this case, that means firewall off all but a few nodes in >>>>>>>> your >>>>>>>> cluster for 10-30 seconds, make sure it gets the schema (watch the >>>>>>>> logs or >>>>>>>> file system for the tables to be created), then remove the firewall so >>>>>>>> it >>>>>>>> can start the bootstrap process (it needs the schema to setup the >>>>>>>> streaming >>>>>>>> plan, and it needs all the hosts up in gossip to stream successfully, >>>>>>>> so >>>>>>>> this is an ugly hack to give you time to get the schema and then heal >>>>>>>> the >>>>>>>> cluster so it can bootstrap). >>>>>>>> >>>>>>>> Yea that’s awful. Hopefully either of the two above JIRAs lands to >>>>>>>> make this less awful. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Jeff Jirsa >>>>>>>> >>>>>>>> >>>>>>>> On Aug 29, 2018, at 6:29 PM, Jai Bheemsen Rao Dhanwada < >>>>>>>> jaibheem...@gmail.com> wrote: >>>>>>>> >>>>>>>> It fails before bootstrap >>>>>>>> >>>>>>>> streaming throughpu on the nodes is set to 400Mb/ps >>>>>>>> >>>>>>>> On Wednesday, August 29, 2018, Jeff Jirsa <jji...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Is the bootstrap plan succeeding (does streaming start or does it >>>>>>>>> crash before it logs messages about streaming starting)? >>>>>>>>> >>>>>>>>> Have you capped the stream throughput on the existing hosts? >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Jeff Jirsa >>>>>>>>> >>>>>>>>> >>>>>>>>> On Aug 29, 2018, at 5:02 PM, Jai Bheemsen Rao Dhanwada < >>>>>>>>> jaibheem...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> Hello All, >>>>>>>>> >>>>>>>>> We are seeing some issue when we add more nodes to the cluster, >>>>>>>>> where new node bootstrap is not able to stream the entire metadata and >>>>>>>>> fails to bootstrap. Finally the process dies with OOM >>>>>>>>> (java.lang.OutOfMemoryError: >>>>>>>>> Java heap space) >>>>>>>>> >>>>>>>>> But if I remove few nodes from the cluster we don't see this issue. >>>>>>>>> >>>>>>>>> Cassandra Version: 2.1.16 >>>>>>>>> # of KS and CF : 100, 3000 (approx) >>>>>>>>> # of DC: 8 >>>>>>>>> # of Vnodes per node: 256 >>>>>>>>> >>>>>>>>> Not sure what is causing this behavior, has any one come across >>>>>>>>> this scenario? >>>>>>>>> thanks in advance. >>>>>>>>> >>>>>>>>>