Are you finding a correlation between the shards on the OOM DC1 nodes and the OOM DC2 nodes? Does your monitoring tool indicate that the DC1 nodes are using significantly more CPU (and memory) than the nodes that are NOT failing? I am leading you down the path to suspect that your sharding is giving you hot spots. Also are you using vnodes?
Patrick > > On Wed, Mar 4, 2015 at 9:26 AM, Jan <cne...@yahoo.com> wrote: > >> HI Roni; >> >> You mentioned: >> DC1 servers have 32GB of RAM and 10GB of HEAP. DC2 machines have 16GB of >> RAM and 5GB HEAP. >> >> Best practices would be be to: >> a) have a consistent type of node across both DC's. (CPUs, Memory, Heap >> & Disk) >> b) increase heap on DC2 servers to be 8GB for C* Heap >> >> The leveled compaction issue is not addressed by this. >> hope this helps >> >> Jan/ >> >> >> >> >> On Wednesday, March 4, 2015 8:41 AM, Roni Balthazar < >> ronibaltha...@gmail.com> wrote: >> >> >> Hi there, >> >> We are running C* 2.1.3 cluster with 2 DataCenters: DC1: 30 Servers / >> DC2 - 10 Servers. >> DC1 servers have 32GB of RAM and 10GB of HEAP. DC2 machines have 16GB >> of RAM and 5GB HEAP. >> DC1 nodes have about 1.4TB of data and DC2 nodes 2.3TB. >> DC2 is used only for backup purposes. There are no reads on DC2. >> Every writes and reads are on DC1 using LOCAL_ONE and the RF DC1: 2 and >> DC2: 1. >> All keyspaces have STCS (Average 20~30 SSTables count each table on >> both DCs) except one that is using LCS (DC1: Avg 4K~7K SSTables / DC2: >> Avg 3K~14K SSTables). >> >> Basically we are running into 2 problems: >> >> 1) High SSTables count on keyspace using LCS (This KS has 500GB~600GB >> of data on each DC1 node). >> 2) There are 2 servers on DC1 and 4 servers in DC2 that went down with >> the OOM error message below: >> >> ERROR [SharedPool-Worker-111] 2015-03-04 05:03:26,394 >> JVMStabilityInspector.java:94 - JVM state determined to be unstable. >> Exiting forcefully due to: >> java.lang.OutOfMemoryError: Java heap space >> at >> org.apache.cassandra.db.composites.CompoundSparseCellNameType.copyAndMakeWith(CompoundSparseCellNameType.java:186) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> at >> org.apache.cassandra.db.composites.AbstractCompoundCellNameType$CompositeDeserializer.readNext(AbstractCompoundCellNameType.java:286) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> at >> org.apache.cassandra.db.AtomDeserializer.readNext(AtomDeserializer.java:104) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> at >> org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetcher.getNextBlock(IndexedSliceReader.java:426) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> at >> org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetcher.fetchMoreData(IndexedSliceReader.java:350) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> at >> org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:142) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> at >> org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:44) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> at >> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) >> ~[guava-16.0.jar:na] >> at >> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) >> ~[guava-16.0.jar:na] >> at >> org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableSliceIterator.java:82) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> at >> org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:172) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> at >> org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:155) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> at >> org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:146) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> at >> org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:125) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> at >> org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:99) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> at >> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) >> ~[guava-16.0.jar:na] >> at >> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) >> ~[guava-16.0.jar:na] >> at >> org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:203) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> at >> org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:107) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> at >> org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:81) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> at >> org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:69) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> at >> org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:320) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> at >> org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:62) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> at >> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1915) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> at >> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1748) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> at org.apache.cassandra.db.Keyspace.getRow(Keyspace.java:342) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> at >> org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:57) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> at >> org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1486) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> at >> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2171) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> at >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >> ~[na:1.8.0_31] >> at >> org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> at >> org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) >> ~[apache-cassandra-2.1.3.jar:2.1.3] >> >> So I am asking how to debug this issue and what are the best practices >> in this situation? >> >> Regards, >> >> Roni >> >> >> >