After having just solved one repair problem, I immediately hit
another.  Again, much appreciation for suggestions...

I'm having problems repairing a CF, and the failure consistenly brings
down 2 of the 6 nodes in the cluster.  I'm running "repair -pr" on a
single CF on node2, the repair starts streaming, and after about 60
seconds both node2 and node4 crash with java.lang.OutOfMemoryError.
The keyspace has rf=3 and is being actively written to by our
application.

The abbrieviated logs below show the pattern, after which I kill -9
and restart cassandra on the two nodes.  What extra info should I
include?  I'm kind of overwhelmed by the volume of logs being
generated and not sure what is signal vs noise.  I'm especially seeing
big repeating sections of StatusLogger and FlushWriter/Memtable.

Details:
6 node cluster
cassandra  1.2.2 - single token per node
RandomPartitioner, EC2Snitch
Replication: SimpleStrategy, rf=3
Ubuntu 10.10 x86_64
EC2 m1.large
Cassandra max heap: 1867M


node2 (abbrieviated logs)

ERROR 21:11:22 AbstractStreamSession.java Stream failed because [node4] died
GC for ConcurrentMarkSweep: 2365 ms for 2 collections, 1913603168
used; max is 1937768448
Pool Name                    Active   Pending   Blocked
ReadStage                         7         7         0
RequestResponseStage              0         0         0
ReadRepairStage                   0         0         0
MutationStage                    32      4707         0
ReplicateOnWriteStage             0         0         0
GossipStage                       0         0         0
AntiEntropyStage                  0         0         0
MigrationStage                    0         0         0
MemtablePostFlusher               1         1         0
FlushWriter                       1         1         0
MiscStage                         0         0         0
commitlog_archiver                0         0         0
InternalResponseStage             0         0         0
AntiEntropySessions               1         1         0
HintedHandoff                     0         0         0
CompactionManager                 1        21
MessagingService                n/a    291,35
WARN  21:12:52 GCInspector.java Heap is 0.9875293252788064 full
INFO  21:12:52 Gossiper.java InetAddress [node5] is now dead.
INFO  21:12:52 Gossiper.java InetAddress [node1] is now dead.
INFO  21:12:52 Gossiper.java InetAddress [node6] is now dead.
INFO  21:12:52 ColumnFamilyStore.java Enqueuing flush of Memtable-[MyCF]@...
INFO  21:12:52 MessagingService.java 4415 MUTATION messages dropped in
last 5000ms
INFO  21:12:52 Gossiper.java InetAddress [node5] is now UP
INFO  21:12:52 Gossiper.java InetAddress [node1] is now UP
INFO  21:12:52 Gossiper.java InetAddress [node6] is now UP
INFO  21:12:52 HintedHandOffManager.java Started hinted handoff for
host: [node5]
INFO  21:12:52 HintedHandOffManager.java Started hinted handoff for
host: [node1]
ERROR 21:12:56 CassandraDaemon.java java.lang.OutOfMemoryError: Java heap space
(full OutOfMemory stack trace is included at bottom)

node4 (abbrieviated logs)

INFO 21:10:05 StreamOutSession.java Streaming to [node2]
INFO 21:10:14 CompactionTask.java Compacted 4 sstables to [MyCF-ib-17665]
INFO 21:10:24 StreamReplyVerbHandler.java Successfully sent
[MyCF]-ib-17647-Data.db to [node2]
INFO 21:10:24 GCInspector.java GC for ConcurrentMarkSweep
GC for ConcurrentMarkSweep: 764 ms for 3 collections, 1408393640 used;
max is 1937768448
GC for ConcurrentMarkSweep: 2198 ms for 2 collections, 1882942392
used; max is 1937768448
Pool Name                    Active   Pending   Blocked
ReadStage                         5         5         0
RequestResponseStage              0        20         0
ReadRepairStage                   0         0         0
MutationStage                     0         0         0
ReplicateOnWriteStage             0         0         0
GossipStage                       0         8         0
AntiEntropyStage                  0         0         0
MigrationStage                    0         0         0
MemtablePostFlusher               0         0         0
FlushWriter                       0         0         0
MiscStage                         0         0         0
commitlog_archiver                0         0         0
InternalResponseStage             0         0         0
AntiEntropySessions               0         0         0
HintedHandoff                     1         1         0
CompactionManager                 0         6
MessagingService                n/a     10,15
INFO 21:11:35 Gossiper.java InetAddress [node5] is now dead.
INFO 21:11:35 Gossiper.java InetAddress [node2] is now dead.
ERROR 21:13:17 CassandraDaemon.java java.lang.OutOfMemoryError: Java heap space
(full OutOfMemory stack trace is included at bottom)




node2 full OOM stack trace:

ERROR [Thread-417] 2013-03-20 21:12:56,114 CassandraDaemon.java (line
133) Exception in thread Thread[Thread-417,5,main]
java.lang.OutOfMemoryError: Java heap space
        at org.apache.cassandra.utils.obs.OpenBitSet.<init>(OpenBitSet.java:76)
        at 
org.apache.cassandra.utils.FilterFactory.createFilter(FilterFactory.java:143)
        at 
org.apache.cassandra.utils.FilterFactory.getFilter(FilterFactory.java:114)
        at 
org.apache.cassandra.utils.FilterFactory.getFilter(FilterFactory.java:101)
        at org.apache.cassandra.db.ColumnIndex.<init>(ColumnIndex.java:40)
        at org.apache.cassandra.db.ColumnIndex.<init>(ColumnIndex.java:31)
        at 
org.apache.cassandra.db.ColumnIndex$Builder.<init>(ColumnIndex.java:74)
        at 
org.apache.cassandra.io.sstable.SSTableWriter.appendFromStream(SSTableWriter.java:243)
        at 
org.apache.cassandra.streaming.IncomingStreamReader.streamIn(IncomingStreamReader.java:179)
        at 
org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamReader.java:122)
        at 
org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpConnection.java:226)
        at 
org.apache.cassandra.net.IncomingTcpConnection.handleStream(IncomingTcpConnection.java:166)
        at 
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:66)

node4 full OOM stack trace:

ERROR [Thread-326] 2013-03-20 21:13:22,829 CassandraDaemon.java (line
133) Exception in thread Thread[Thread-326,5,main]
java.lang.OutOfMemoryError: Java heap space
        at org.apache.cassandra.utils.obs.OpenBitSet.<init>(OpenBitSet.java:76)
        at 
org.apache.cassandra.utils.FilterFactory.createFilter(FilterFactory.java:143)
        at 
org.apache.cassandra.utils.FilterFactory.getFilter(FilterFactory.java:114)
        at 
org.apache.cassandra.utils.FilterFactory.getFilter(FilterFactory.java:101)
        at org.apache.cassandra.db.ColumnIndex.<init>(ColumnIndex.java:40)
        at org.apache.cassandra.db.ColumnIndex.<init>(ColumnIndex.java:31)
        at 
org.apache.cassandra.db.ColumnIndex$Builder.<init>(ColumnIndex.java:74)
        at 
org.apache.cassandra.io.sstable.SSTableWriter.appendFromStream(SSTableWriter.java:243)
        at 
org.apache.cassandra.streaming.IncomingStreamReader.streamIn(IncomingStreamReader.java:179)
        at 
org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamReader.java:122)
        at 
org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpConnection.java:226)
        at 
org.apache.cassandra.net.IncomingTcpConnection.handleStream(IncomingTcpConnection.java:166)
        at 
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:66)


Dane

Reply via email to