Dear Community, advice from you needed.
We have a cluster, 1/6 nodes of which died for various reasons(3 had OOM message). Nodes died in groups of 3, 1, 2. No adjacent died, though we use SimpleSnitch. Version: 1.1.6 Hardware: 12Gb RAM / 8 cores(virtual) Data: 40Gb/node Nodes: 36 nodes Keyspaces: 2(RF=3, R=W=2) + 1(OpsCenter) CFs: 36, 2 indexes Partitioner: Random Compaction: Leveled(we don't want 2x space for housekeeping) Caching: Keys only All is pretty much standard apart from the one CF receiving writes in 64K chunks and having sstable_size_in_mb=100. No JNA installed - this is to be fixed soon. Checking sysstat/sar I can see 80-90% CPU idle, no anomalies in io and the only change - network activity spiking. All the nodes before dying had the following on logs: > INFO [ScheduledTasks:1] 2012-11-15 21:35:05,512 StatusLogger.java (line 72) MemtablePostFlusher 1 4 0 > INFO [ScheduledTasks:1] 2012-11-15 21:35:13,540 StatusLogger.java (line 72) FlushWriter 1 3 0 > INFO [ScheduledTasks:1] 2012-11-15 21:36:32,162 StatusLogger.java (line 72) HintedHandoff 1 6 0 > INFO [ScheduledTasks:1] 2012-11-15 21:36:32,162 StatusLogger.java (line 77) CompactionManager 5 9 GCInspector warnings were there too, they went from ~0.8 to 3Gb heap in 5-10mins. So, could you please give me a hint on: 1. How much GCInspector warnings per hour are considered 'normal'? 2. What should be the next thing to check? 3. What are the possible failure reasons and how to prevent those? Thank you very much in advance, Ivan