We think it is this bug:
https://issues.apache.org/jira/browse/CASSANDRA-8860

We're rolling a patch to beta before rolling it into production.

On Wed, Mar 4, 2015 at 4:12 PM, graham sanderson <gra...@vast.com> wrote:

> We can confirm a problem on 2.1.3 (sadly our beta sstable state obviously
> did not match our production ones in some critical way)
>
> We have about 20k sstables on each of 6 nodes right now; actually a quick
> glance shows 15k of those are from OpsCenter, which may have something to
> do with beta/production mismatch
>
> I will look into the open OOM JIRA issue against 2.1.3 - we may being
> penalized for heavy use of JBOD (x7 per node)
>
> It also looks like 2.1.3 is leaking memory, though it eventually recovers
> via GCInspector causing a complete memtable flush.
>
> On Mar 4, 2015, at 12:31 PM, daemeon reiydelle <daeme...@gmail.com> wrote:
>
> Are you finding a correlation between the shards on the OOM DC1 nodes and
> the OOM DC2 nodes? Does your monitoring tool indicate that the DC1 nodes
> are using significantly more CPU (and memory) than the nodes that are NOT
> failing? I am leading you down the path to suspect that your sharding is
> giving you hot spots. Also are you using vnodes?
>
> Patrick
>
>>
>> On Wed, Mar 4, 2015 at 9:26 AM, Jan <cne...@yahoo.com> wrote:
>>
>>> HI Roni;
>>>
>>> You mentioned:
>>> DC1 servers have 32GB of RAM and 10GB of HEAP. DC2 machines have 16GB of
>>> RAM and 5GB HEAP.
>>>
>>> Best practices would be be to:
>>> a)  have a consistent type of node across both DC's.  (CPUs, Memory,
>>> Heap & Disk)
>>> b)  increase heap on DC2 servers to be  8GB for C* Heap
>>>
>>> The leveled compaction issue is not addressed by this.
>>> hope this helps
>>>
>>> Jan/
>>>
>>>
>>>
>>>
>>>   On Wednesday, March 4, 2015 8:41 AM, Roni Balthazar <
>>> ronibaltha...@gmail.com> wrote:
>>>
>>>
>>> Hi there,
>>>
>>> We are running C* 2.1.3 cluster with 2 DataCenters: DC1: 30 Servers /
>>> DC2 - 10 Servers.
>>> DC1 servers have 32GB of RAM and 10GB of HEAP. DC2 machines have 16GB
>>> of RAM and 5GB HEAP.
>>> DC1 nodes have about 1.4TB of data and DC2 nodes 2.3TB.
>>> DC2 is used only for backup purposes. There are no reads on DC2.
>>> Every writes and reads are on DC1 using LOCAL_ONE and the RF DC1: 2 and
>>> DC2: 1.
>>> All keyspaces have STCS (Average 20~30 SSTables count each table on
>>> both DCs) except one that is using LCS (DC1: Avg 4K~7K SSTables / DC2:
>>> Avg 3K~14K SSTables).
>>>
>>> Basically we are running into 2 problems:
>>>
>>> 1) High SSTables count on keyspace using LCS (This KS has 500GB~600GB
>>> of data on each DC1 node).
>>> 2) There are 2 servers on DC1 and 4 servers in DC2 that went down with
>>> the OOM error message below:
>>>
>>> ERROR [SharedPool-Worker-111] 2015-03-04 05:03:26,394
>>> JVMStabilityInspector.java:94 - JVM state determined to be unstable.
>>> Exiting forcefully due to:
>>> java.lang.OutOfMemoryError: Java heap space
>>>         at
>>> org.apache.cassandra.db.composites.CompoundSparseCellNameType.copyAndMakeWith(CompoundSparseCellNameType.java:186)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>         at
>>> org.apache.cassandra.db.composites.AbstractCompoundCellNameType$CompositeDeserializer.readNext(AbstractCompoundCellNameType.java:286)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>         at
>>> org.apache.cassandra.db.AtomDeserializer.readNext(AtomDeserializer.java:104)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>         at
>>> org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetcher.getNextBlock(IndexedSliceReader.java:426)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>         at
>>> org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetcher.fetchMoreData(IndexedSliceReader.java:350)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>         at
>>> org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:142)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>         at
>>> org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:44)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>         at
>>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
>>> ~[guava-16.0.jar:na]
>>>         at
>>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
>>> ~[guava-16.0.jar:na]
>>>         at
>>> org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableSliceIterator.java:82)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>         at
>>> org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:172)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>         at
>>> org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:155)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>         at
>>> org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:146)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>         at
>>> org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:125)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>         at
>>> org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:99)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>         at
>>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
>>> ~[guava-16.0.jar:na]
>>>         at
>>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
>>> ~[guava-16.0.jar:na]
>>>         at
>>> org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:203)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>         at
>>> org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:107)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>         at
>>> org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:81)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>         at
>>> org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:69)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>         at
>>> org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:320)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>         at
>>> org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:62)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>         at
>>> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1915)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>         at
>>> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1748)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>         at org.apache.cassandra.db.Keyspace.getRow(Keyspace.java:342)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>         at
>>> org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:57)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>         at
>>> org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1486)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>         at
>>> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2171)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>         at
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>> ~[na:1.8.0_31]
>>>         at
>>> org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>         at
>>> org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105)
>>> ~[apache-cassandra-2.1.3.jar:2.1.3]
>>>
>>> So I am asking how to debug this issue and what are the best practices
>>> in this situation?
>>>
>>> Regards,
>>>
>>> Roni
>>>
>>>
>>>
>>
>
>

Reply via email to