After having just solved one repair problem, I immediately hit another. Again, much appreciation for suggestions...
I'm having problems repairing a CF, and the failure consistenly brings down 2 of the 6 nodes in the cluster. I'm running "repair -pr" on a single CF on node2, the repair starts streaming, and after about 60 seconds both node2 and node4 crash with java.lang.OutOfMemoryError. The keyspace has rf=3 and is being actively written to by our application. The abbrieviated logs below show the pattern, after which I kill -9 and restart cassandra on the two nodes. What extra info should I include? I'm kind of overwhelmed by the volume of logs being generated and not sure what is signal vs noise. I'm especially seeing big repeating sections of StatusLogger and FlushWriter/Memtable. Details: 6 node cluster cassandra 1.2.2 - single token per node RandomPartitioner, EC2Snitch Replication: SimpleStrategy, rf=3 Ubuntu 10.10 x86_64 EC2 m1.large Cassandra max heap: 1867M node2 (abbrieviated logs) ERROR 21:11:22 AbstractStreamSession.java Stream failed because [node4] died GC for ConcurrentMarkSweep: 2365 ms for 2 collections, 1913603168 used; max is 1937768448 Pool Name Active Pending Blocked ReadStage 7 7 0 RequestResponseStage 0 0 0 ReadRepairStage 0 0 0 MutationStage 32 4707 0 ReplicateOnWriteStage 0 0 0 GossipStage 0 0 0 AntiEntropyStage 0 0 0 MigrationStage 0 0 0 MemtablePostFlusher 1 1 0 FlushWriter 1 1 0 MiscStage 0 0 0 commitlog_archiver 0 0 0 InternalResponseStage 0 0 0 AntiEntropySessions 1 1 0 HintedHandoff 0 0 0 CompactionManager 1 21 MessagingService n/a 291,35 WARN 21:12:52 GCInspector.java Heap is 0.9875293252788064 full INFO 21:12:52 Gossiper.java InetAddress [node5] is now dead. INFO 21:12:52 Gossiper.java InetAddress [node1] is now dead. INFO 21:12:52 Gossiper.java InetAddress [node6] is now dead. INFO 21:12:52 ColumnFamilyStore.java Enqueuing flush of Memtable-[MyCF]@... INFO 21:12:52 MessagingService.java 4415 MUTATION messages dropped in last 5000ms INFO 21:12:52 Gossiper.java InetAddress [node5] is now UP INFO 21:12:52 Gossiper.java InetAddress [node1] is now UP INFO 21:12:52 Gossiper.java InetAddress [node6] is now UP INFO 21:12:52 HintedHandOffManager.java Started hinted handoff for host: [node5] INFO 21:12:52 HintedHandOffManager.java Started hinted handoff for host: [node1] ERROR 21:12:56 CassandraDaemon.java java.lang.OutOfMemoryError: Java heap space (full OutOfMemory stack trace is included at bottom) node4 (abbrieviated logs) INFO 21:10:05 StreamOutSession.java Streaming to [node2] INFO 21:10:14 CompactionTask.java Compacted 4 sstables to [MyCF-ib-17665] INFO 21:10:24 StreamReplyVerbHandler.java Successfully sent [MyCF]-ib-17647-Data.db to [node2] INFO 21:10:24 GCInspector.java GC for ConcurrentMarkSweep GC for ConcurrentMarkSweep: 764 ms for 3 collections, 1408393640 used; max is 1937768448 GC for ConcurrentMarkSweep: 2198 ms for 2 collections, 1882942392 used; max is 1937768448 Pool Name Active Pending Blocked ReadStage 5 5 0 RequestResponseStage 0 20 0 ReadRepairStage 0 0 0 MutationStage 0 0 0 ReplicateOnWriteStage 0 0 0 GossipStage 0 8 0 AntiEntropyStage 0 0 0 MigrationStage 0 0 0 MemtablePostFlusher 0 0 0 FlushWriter 0 0 0 MiscStage 0 0 0 commitlog_archiver 0 0 0 InternalResponseStage 0 0 0 AntiEntropySessions 0 0 0 HintedHandoff 1 1 0 CompactionManager 0 6 MessagingService n/a 10,15 INFO 21:11:35 Gossiper.java InetAddress [node5] is now dead. INFO 21:11:35 Gossiper.java InetAddress [node2] is now dead. ERROR 21:13:17 CassandraDaemon.java java.lang.OutOfMemoryError: Java heap space (full OutOfMemory stack trace is included at bottom) node2 full OOM stack trace: ERROR [Thread-417] 2013-03-20 21:12:56,114 CassandraDaemon.java (line 133) Exception in thread Thread[Thread-417,5,main] java.lang.OutOfMemoryError: Java heap space at org.apache.cassandra.utils.obs.OpenBitSet.<init>(OpenBitSet.java:76) at org.apache.cassandra.utils.FilterFactory.createFilter(FilterFactory.java:143) at org.apache.cassandra.utils.FilterFactory.getFilter(FilterFactory.java:114) at org.apache.cassandra.utils.FilterFactory.getFilter(FilterFactory.java:101) at org.apache.cassandra.db.ColumnIndex.<init>(ColumnIndex.java:40) at org.apache.cassandra.db.ColumnIndex.<init>(ColumnIndex.java:31) at org.apache.cassandra.db.ColumnIndex$Builder.<init>(ColumnIndex.java:74) at org.apache.cassandra.io.sstable.SSTableWriter.appendFromStream(SSTableWriter.java:243) at org.apache.cassandra.streaming.IncomingStreamReader.streamIn(IncomingStreamReader.java:179) at org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamReader.java:122) at org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpConnection.java:226) at org.apache.cassandra.net.IncomingTcpConnection.handleStream(IncomingTcpConnection.java:166) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:66) node4 full OOM stack trace: ERROR [Thread-326] 2013-03-20 21:13:22,829 CassandraDaemon.java (line 133) Exception in thread Thread[Thread-326,5,main] java.lang.OutOfMemoryError: Java heap space at org.apache.cassandra.utils.obs.OpenBitSet.<init>(OpenBitSet.java:76) at org.apache.cassandra.utils.FilterFactory.createFilter(FilterFactory.java:143) at org.apache.cassandra.utils.FilterFactory.getFilter(FilterFactory.java:114) at org.apache.cassandra.utils.FilterFactory.getFilter(FilterFactory.java:101) at org.apache.cassandra.db.ColumnIndex.<init>(ColumnIndex.java:40) at org.apache.cassandra.db.ColumnIndex.<init>(ColumnIndex.java:31) at org.apache.cassandra.db.ColumnIndex$Builder.<init>(ColumnIndex.java:74) at org.apache.cassandra.io.sstable.SSTableWriter.appendFromStream(SSTableWriter.java:243) at org.apache.cassandra.streaming.IncomingStreamReader.streamIn(IncomingStreamReader.java:179) at org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamReader.java:122) at org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpConnection.java:226) at org.apache.cassandra.net.IncomingTcpConnection.handleStream(IncomingTcpConnection.java:166) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:66) Dane