okay, thank you On Wed, Aug 29, 2018 at 11:04 PM Jeff Jirsa <jji...@gmail.com> wrote:
> You’re seeing an OOM, not a socket error / timeout. > > -- > Jeff Jirsa > > > On Aug 29, 2018, at 10:56 PM, Jai Bheemsen Rao Dhanwada < > jaibheem...@gmail.com> wrote: > > Jeff, > > any idea if this is somehow related to : > https://issues.apache.org/jira/browse/CASSANDRA-11840? > does increasing the value of streaming_socket_timeout_in_ms to a higher > value helps? > > On Wed, Aug 29, 2018 at 10:52 PM Jai Bheemsen Rao Dhanwada < > jaibheem...@gmail.com> wrote: > >> I have 72 nodes in the cluster, across 8 datacenters.. the moment I try >> to increase the node above 84 or so, the issue starts. >> >> I am still using CMS Heap, assuming it will create more harm if I >> increase the heap size beyond 8G(recommended). >> >> On Wed, Aug 29, 2018 at 6:53 PM Jeff Jirsa <jji...@gmail.com> wrote: >> >>> Given the size of your schema, you’re probably getting flooded with a >>> bunch of huge schema mutations as it hops into gossip and tries to pull the >>> schema from every host it sees. You say 8 DCs but you don’t say how many >>> nodes - I’m guessing it’s a lot? >>> >>> This is something that’s incrementally better in 3.0, but a real proper >>> fix has been talked about a few times - >>> https://issues.apache.org/jira/browse/CASSANDRA-11748 and >>> https://issues.apache.org/jira/browse/CASSANDRA-13569 for example >>> >>> In the short term, you may be able to work around this by increasing >>> your heap size. If that doesn’t work, there’s an ugly ugly hack that’ll >>> work on 2.1: limiting the number of schema blobs you can get at a time - >>> in this case, that means firewall off all but a few nodes in your cluster >>> for 10-30 seconds, make sure it gets the schema (watch the logs or file >>> system for the tables to be created), then remove the firewall so it can >>> start the bootstrap process (it needs the schema to setup the streaming >>> plan, and it needs all the hosts up in gossip to stream successfully, so >>> this is an ugly hack to give you time to get the schema and then heal the >>> cluster so it can bootstrap). >>> >>> Yea that’s awful. Hopefully either of the two above JIRAs lands to make >>> this less awful. >>> >>> >>> >>> -- >>> Jeff Jirsa >>> >>> >>> On Aug 29, 2018, at 6:29 PM, Jai Bheemsen Rao Dhanwada < >>> jaibheem...@gmail.com> wrote: >>> >>> It fails before bootstrap >>> >>> streaming throughpu on the nodes is set to 400Mb/ps >>> >>> On Wednesday, August 29, 2018, Jeff Jirsa <jji...@gmail.com> wrote: >>> >>>> Is the bootstrap plan succeeding (does streaming start or does it crash >>>> before it logs messages about streaming starting)? >>>> >>>> Have you capped the stream throughput on the existing hosts? >>>> >>>> -- >>>> Jeff Jirsa >>>> >>>> >>>> On Aug 29, 2018, at 5:02 PM, Jai Bheemsen Rao Dhanwada < >>>> jaibheem...@gmail.com> wrote: >>>> >>>> Hello All, >>>> >>>> We are seeing some issue when we add more nodes to the cluster, where >>>> new node bootstrap is not able to stream the entire metadata and fails to >>>> bootstrap. Finally the process dies with OOM (java.lang.OutOfMemoryError: >>>> Java heap space) >>>> >>>> But if I remove few nodes from the cluster we don't see this issue. >>>> >>>> Cassandra Version: 2.1.16 >>>> # of KS and CF : 100, 3000 (approx) >>>> # of DC: 8 >>>> # of Vnodes per node: 256 >>>> >>>> Not sure what is causing this behavior, has any one come across this >>>> scenario? >>>> thanks in advance. >>>> >>>>