Hi there, We are running C* 2.1.3 cluster with 2 DataCenters: DC1: 30 Servers / DC2 - 10 Servers. DC1 servers have 32GB of RAM and 10GB of HEAP. DC2 machines have 16GB of RAM and 5GB HEAP. DC1 nodes have about 1.4TB of data and DC2 nodes 2.3TB. DC2 is used only for backup purposes. There are no reads on DC2. Every writes and reads are on DC1 using LOCAL_ONE and the RF DC1: 2 and DC2: 1. All keyspaces have STCS (Average 20~30 SSTables count each table on both DCs) except one that is using LCS (DC1: Avg 4K~7K SSTables / DC2: Avg 3K~14K SSTables).
Basically we are running into 2 problems: 1) High SSTables count on keyspace using LCS (This KS has 500GB~600GB of data on each DC1 node). 2) There are 2 servers on DC1 and 4 servers in DC2 that went down with the OOM error message below: ERROR [SharedPool-Worker-111] 2015-03-04 05:03:26,394 JVMStabilityInspector.java:94 - JVM state determined to be unstable. Exiting forcefully due to: java.lang.OutOfMemoryError: Java heap space at org.apache.cassandra.db.composites.CompoundSparseCellNameType.copyAndMakeWith(CompoundSparseCellNameType.java:186) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.db.composites.AbstractCompoundCellNameType$CompositeDeserializer.readNext(AbstractCompoundCellNameType.java:286) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.db.AtomDeserializer.readNext(AtomDeserializer.java:104) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetcher.getNextBlock(IndexedSliceReader.java:426) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetcher.fetchMoreData(IndexedSliceReader.java:350) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:142) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:44) ~[apache-cassandra-2.1.3.jar:2.1.3] at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) ~[guava-16.0.jar:na] at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) ~[guava-16.0.jar:na] at org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableSliceIterator.java:82) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:172) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:155) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:146) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:125) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:99) ~[apache-cassandra-2.1.3.jar:2.1.3] at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) ~[guava-16.0.jar:na] at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) ~[guava-16.0.jar:na] at org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:203) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:107) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:81) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:69) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:320) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:62) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1915) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1748) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.db.Keyspace.getRow(Keyspace.java:342) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:57) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1486) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2171) ~[apache-cassandra-2.1.3.jar:2.1.3] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_31] at org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) ~[apache-cassandra-2.1.3.jar:2.1.3] So I am asking how to debug this issue and what are the best practices in this situation? Regards, Roni