Hello, More than 30 plus Cassandra servers in the primary DC went down OOM exception below. What puzzles me is the scale at which it happened (at the same minute). I will share some more details below.
System Log: http://pastebin.com/iPeYrWVR GC Log: http://pastebin.com/CzNNGs0r <http://pastebin.com/CzNNGs0r>During the OOM I saw lot of WARNings like the below (these were there for quite sometime may be weeks) WARN [SharedPool-Worker-81] 2017-03-01 19:55:41,209 BatchStatement.java:252 - Batch of prepared statements for [keyspace.table] is of size 225455, exceeding specified threshold of 65536 by 159919. Environment: We are using ApacheCassandra-2.1.9 on Multi DC cluster. Primary DC (more C* nodes on SSD and apps run here) and secondary DC (geographically remote and more like a DR to primary) on SAS drives. Cassandra config: Java 1.8.0_65 Garbage Collector: G1GC memtable_allocation_type: offheap_objects Post this OOM I am seeing huge hints pile up on majority of the nodes and the pending hints keep going up. I have increased HintedHandoff CoreThreads to 6 but that did not help (I admit that I tried this on one node to try). nodetool compactionstats -H pending tasks: 3 compaction type keyspace table completed total unit progress Compaction system hints 28.5 GB 92.38 GB bytes 30.85% Appreciate your inputs here. Thanks, Shravan