Given the size of your schema, you’re probably getting flooded with a bunch of huge schema mutations as it hops into gossip and tries to pull the schema from every host it sees. You say 8 DCs but you don’t say how many nodes - I’m guessing it’s a lot?
This is something that’s incrementally better in 3.0, but a real proper fix has been talked about a few times - https://issues.apache.org/jira/browse/CASSANDRA-11748 and https://issues.apache.org/jira/browse/CASSANDRA-13569 for example In the short term, you may be able to work around this by increasing your heap size. If that doesn’t work, there’s an ugly ugly hack that’ll work on 2.1: limiting the number of schema blobs you can get at a time - in this case, that means firewall off all but a few nodes in your cluster for 10-30 seconds, make sure it gets the schema (watch the logs or file system for the tables to be created), then remove the firewall so it can start the bootstrap process (it needs the schema to setup the streaming plan, and it needs all the hosts up in gossip to stream successfully, so this is an ugly hack to give you time to get the schema and then heal the cluster so it can bootstrap). Yea that’s awful. Hopefully either of the two above JIRAs lands to make this less awful. -- Jeff Jirsa > On Aug 29, 2018, at 6:29 PM, Jai Bheemsen Rao Dhanwada > <jaibheem...@gmail.com> wrote: > > It fails before bootstrap > > streaming throughpu on the nodes is set to 400Mb/ps > >> On Wednesday, August 29, 2018, Jeff Jirsa <jji...@gmail.com> wrote: >> Is the bootstrap plan succeeding (does streaming start or does it crash >> before it logs messages about streaming starting)? >> >> Have you capped the stream throughput on the existing hosts? >> >> -- >> Jeff Jirsa >> >> >>> On Aug 29, 2018, at 5:02 PM, Jai Bheemsen Rao Dhanwada >>> <jaibheem...@gmail.com> wrote: >>> >>> Hello All, >>> >>> We are seeing some issue when we add more nodes to the cluster, where new >>> node bootstrap is not able to stream the entire metadata and fails to >>> bootstrap. Finally the process dies with OOM (java.lang.OutOfMemoryError: >>> Java heap space) >>> >>> But if I remove few nodes from the cluster we don't see this issue. >>> >>> Cassandra Version: 2.1.16 >>> # of KS and CF : 100, 3000 (approx) >>> # of DC: 8 >>> # of Vnodes per node: 256 >>> >>> Not sure what is causing this behavior, has any one come across this >>> scenario? >>> thanks in advance.