Hello experts I have a 9 nodes cluster on AWS. Recently, some nodes were down and I want to repair the cluster after I restarted them. But I found the repair operation causes lots of memtable flush and then the JVM GC failed. Consequently, the node hang.
I am using the cassandra 3.1.0. java version "1.8.0_231" Java(TM) SE Runtime Environment (build 1.8.0_231-b32) Java HotSpot(TM) 64-Bit Server VM (build 25.231-b32, mixed mode) The node hardware is 32GB mem and 4 cores CPU. The heap is 16GB. For each node, about 200 GB sstables. The JVM hang is very fast. After the repair process starts, everything works. I checked the memory, cpu and IO. No stress found. After some time (maybe the streaming task is completing), the memtableflushwriter pending task increasing very fast, and then GC failed. The JVM hang and the heapdump created. When the issue happened, the CPU is in a low usage, and I cannot find I/O latency on AWS EBS disk metrics. Logs as below WARN [Service Thread] 2020-04-02 05:07:15,104 GCInspector.java:282 - ConcurrentMarkSweep GC in 6830ms. CMS Old Gen: 12265186360 -> 3201035496; Par Eden Space: 671088640 -> 0; Par Survivor Space: 83886080 -> 0 13:07:01 INFO [Service Thread] 2020-04-02 05:07:15,104 StatusLogger.java:47 - Pool Name Active Pending Completed Blocked All Time Blocked 13:07:01 WARN [ScheduledTasks:1] 2020-04-02 05:07:15,105 QueryProcessor.java:105 - 2 prepared statements discarded in the last minute because cache limit reached (63 MB) 13:07:01 INFO [Service Thread] 2020-04-02 05:07:15,171 StatusLogger.java:51 - MutationStage 32 70 1450161111 0 0 WARN [Service Thread] 2020-04-02 05:08:30,093 GCInspector.java:282 - ConcurrentMarkSweep GC in 7490ms. CMS Old Gen: 16086342792 -> 9748777920; WARN [Service Thread] 2020-04-02 05:09:57,548 GCInspector.java:282 - ConcurrentMarkSweep GC in 7397ms. CMS Old Gen: 15141504128 -> 15001511696; WARN [Service Thread] 2020-04-02 05:10:11,207 GCInspector.java:282 - ConcurrentMarkSweep GC in 6552ms. CMS Old Gen: 16065021280 -> 16252475568; Par Eden Space: 671088640 -> 0; Par Survivor Space: 83886080 -> 0 INFO [Service Thread] 2020-04-02 05:10:11,224 StatusLogger.java:51 - MemtableFlushWriter 2 10800 88712 0 0 I checked the heap dump file. There are several big memtables objects of the table repairing. The memtable objects size is about 400 - 700MB. And the memtables are created in 20 seconds. In addition, I can see more than 12000 memtables. In these memtables, there are more than 6000 sstable_activity memtables. At first, I suspect the memtable flush writer is the bottleneck. So I increase it to 4 threads. And I double the memory of the node. But it doesn't work. During repairing, the pending task increasing fast and then the node hang again. I also decrease the repair token range, only one vnode, but still failed. We also see some logs during streaming task like this WARN [STREAM-IN-/10.0.113.12:7000] 2020-04-02 05:05:57,150 BigTableWriter.java:211 - Writing large partition .... The writing sstables have 300 - 500 MBs. Some big one reaches 2+ GB. I go through the cassandra source code. And I found the sstables must be processed in normal write process if the table has a materialized view. So I suspect the issue occurs in COMPLETE stage in streaming. After streaming, the receive callback function loads the updated partition sstables and create mutation as normal writes. So it increases memtables in heap. In addition, it also invoke flush() which will create extra memtables besides repaired tables. (for example, the sstable_activity in heap dump). The memtables size exceeds the clean up threshold. So flush is called. But flush cannot free enough memories. So many times of flush called. On the other hand, the flush will increase the memtables, too. Is my conclusion correct? If yes, how can I fix this issue? -- Thanks Gb