Hi Jeff, Is there any JIRA that talks about increasing the HEAP will help? Also, any other alternatives than increasing the HEAP Size? last time when I tried increasing the heap, longer GC Pauses caused more damage in terms of latencies while gc pause.
On Wed, Aug 29, 2018 at 11:07 PM Jai Bheemsen Rao Dhanwada < jaibheem...@gmail.com> wrote: > okay, thank you > > On Wed, Aug 29, 2018 at 11:04 PM Jeff Jirsa <jji...@gmail.com> wrote: > >> You’re seeing an OOM, not a socket error / timeout. >> >> -- >> Jeff Jirsa >> >> >> On Aug 29, 2018, at 10:56 PM, Jai Bheemsen Rao Dhanwada < >> jaibheem...@gmail.com> wrote: >> >> Jeff, >> >> any idea if this is somehow related to : >> https://issues.apache.org/jira/browse/CASSANDRA-11840? >> does increasing the value of streaming_socket_timeout_in_ms to a higher >> value helps? >> >> On Wed, Aug 29, 2018 at 10:52 PM Jai Bheemsen Rao Dhanwada < >> jaibheem...@gmail.com> wrote: >> >>> I have 72 nodes in the cluster, across 8 datacenters.. the moment I try >>> to increase the node above 84 or so, the issue starts. >>> >>> I am still using CMS Heap, assuming it will create more harm if I >>> increase the heap size beyond 8G(recommended). >>> >>> On Wed, Aug 29, 2018 at 6:53 PM Jeff Jirsa <jji...@gmail.com> wrote: >>> >>>> Given the size of your schema, you’re probably getting flooded with a >>>> bunch of huge schema mutations as it hops into gossip and tries to pull the >>>> schema from every host it sees. You say 8 DCs but you don’t say how many >>>> nodes - I’m guessing it’s a lot? >>>> >>>> This is something that’s incrementally better in 3.0, but a real proper >>>> fix has been talked about a few times - >>>> https://issues.apache.org/jira/browse/CASSANDRA-11748 and >>>> https://issues.apache.org/jira/browse/CASSANDRA-13569 for example >>>> >>>> In the short term, you may be able to work around this by increasing >>>> your heap size. If that doesn’t work, there’s an ugly ugly hack that’ll >>>> work on 2.1: limiting the number of schema blobs you can get at a time - >>>> in this case, that means firewall off all but a few nodes in your cluster >>>> for 10-30 seconds, make sure it gets the schema (watch the logs or file >>>> system for the tables to be created), then remove the firewall so it can >>>> start the bootstrap process (it needs the schema to setup the streaming >>>> plan, and it needs all the hosts up in gossip to stream successfully, so >>>> this is an ugly hack to give you time to get the schema and then heal the >>>> cluster so it can bootstrap). >>>> >>>> Yea that’s awful. Hopefully either of the two above JIRAs lands to make >>>> this less awful. >>>> >>>> >>>> >>>> -- >>>> Jeff Jirsa >>>> >>>> >>>> On Aug 29, 2018, at 6:29 PM, Jai Bheemsen Rao Dhanwada < >>>> jaibheem...@gmail.com> wrote: >>>> >>>> It fails before bootstrap >>>> >>>> streaming throughpu on the nodes is set to 400Mb/ps >>>> >>>> On Wednesday, August 29, 2018, Jeff Jirsa <jji...@gmail.com> wrote: >>>> >>>>> Is the bootstrap plan succeeding (does streaming start or does it >>>>> crash before it logs messages about streaming starting)? >>>>> >>>>> Have you capped the stream throughput on the existing hosts? >>>>> >>>>> -- >>>>> Jeff Jirsa >>>>> >>>>> >>>>> On Aug 29, 2018, at 5:02 PM, Jai Bheemsen Rao Dhanwada < >>>>> jaibheem...@gmail.com> wrote: >>>>> >>>>> Hello All, >>>>> >>>>> We are seeing some issue when we add more nodes to the cluster, where >>>>> new node bootstrap is not able to stream the entire metadata and fails to >>>>> bootstrap. Finally the process dies with OOM (java.lang.OutOfMemoryError: >>>>> Java heap space) >>>>> >>>>> But if I remove few nodes from the cluster we don't see this issue. >>>>> >>>>> Cassandra Version: 2.1.16 >>>>> # of KS and CF : 100, 3000 (approx) >>>>> # of DC: 8 >>>>> # of Vnodes per node: 256 >>>>> >>>>> Not sure what is causing this behavior, has any one come across this >>>>> scenario? >>>>> thanks in advance. >>>>> >>>>>