Hello experts

I have a 9 nodes cluster on AWS. Recently, some nodes were down and I want
to repair the cluster after I restarted them. But I found the repair
operation causes lots of memtable flush and then the JVM GC failed.
Consequently, the node hang.

I am using the cassandra 3.1.0.

java version "1.8.0_231"
Java(TM) SE Runtime Environment (build 1.8.0_231-b32)
Java HotSpot(TM) 64-Bit Server VM (build 25.231-b32, mixed mode)

The node hardware is 32GB mem and 4 cores CPU. The heap is 16GB. For each
node, about 200 GB sstables.

The JVM hang is very fast. After the repair process starts, everything
works. I checked the memory, cpu and IO. No stress found. After some time
(maybe the streaming task is completing), the memtableflushwriter pending
task increasing very fast, and then GC failed. The JVM hang and the
heapdump created. When the issue happened, the CPU is in a low usage, and I
cannot find I/O latency on AWS EBS disk metrics.

Logs as below
WARN [Service Thread] 2020-04-02 05:07:15,104 GCInspector.java:282 -
ConcurrentMarkSweep GC in 6830ms. CMS Old Gen: 12265186360 -> 3201035496;
Par Eden Space: 671088640 -> 0; Par Survivor Space: 83886080 -> 0

13:07:01
INFO [Service Thread] 2020-04-02 05:07:15,104 StatusLogger.java:47 - Pool
Name Active Pending Completed Blocked All Time Blocked

13:07:01
WARN [ScheduledTasks:1] 2020-04-02 05:07:15,105
QueryProcessor.java:105 - 2 prepared
statements discarded in the last minute because cache limit reached (63 MB)

13:07:01
INFO [Service Thread] 2020-04-02 05:07:15,171 StatusLogger.java:51 -
MutationStage 32 70 1450161111 0 0
WARN [Service Thread] 2020-04-02 05:08:30,093 GCInspector.java:282 -
ConcurrentMarkSweep GC in 7490ms. CMS Old Gen: 16086342792 -> 9748777920;
WARN [Service Thread] 2020-04-02 05:09:57,548 GCInspector.java:282 -
ConcurrentMarkSweep GC in 7397ms. CMS Old Gen: 15141504128 -> 15001511696;
WARN  [Service Thread] 2020-04-02 05:10:11,207 GCInspector.java:282 -
ConcurrentMarkSweep GC in 6552ms.  CMS Old Gen: 16065021280 -> 16252475568;
Par Eden Space: 671088640 -> 0; Par Survivor Space: 83886080 -> 0
INFO  [Service Thread] 2020-04-02 05:10:11,224 StatusLogger.java:51 -
MemtableFlushWriter               2     10800          88712         0
                 0


I checked the heap dump file. There are several big memtables objects of
the table repairing. The memtable objects size is about 400 - 700MB. And
the memtables are created in 20 seconds. In addition, I can see more than
12000 memtables. In these memtables, there are more than 6000
sstable_activity memtables.

At first, I suspect the memtable flush writer is the bottleneck. So I
increase it to 4 threads. And I double the memory of the node. But it
doesn't work. During repairing, the pending task increasing fast and then
the node hang again. I also decrease the repair token range, only one
vnode, but still failed.

We also see some logs during streaming task like this

WARN [STREAM-IN-/10.0.113.12:7000] 2020-04-02 05:05:57,150
BigTableWriter.java:211 - Writing large partition ....

The writing sstables have 300 - 500 MBs. Some big one reaches 2+ GB.

I go through the cassandra source code. And I found the sstables must be
processed in normal write process if the table has a materialized view. So
I suspect the issue occurs in COMPLETE stage in streaming.

After streaming, the receive callback function loads the updated partition
sstables and create mutation as normal writes. So it increases memtables in
heap. In addition, it also invoke flush() which will create extra memtables
besides repaired tables. (for example, the sstable_activity in heap dump).
The memtables size exceeds the clean up threshold. So flush is called. But
flush cannot free enough memories. So many times of flush called. On the
other hand, the flush will increase the memtables, too.

Is my conclusion correct? If yes, how can I fix this issue?

--

Thanks
Gb

Reply via email to