Cassandra node JVM hang during node repair a table with materialized view

2020-04-15 Thread Ben G
Hello experts

I have a 9 nodes cluster on AWS. Recently, some nodes were down and I want
to repair the cluster after I restarted them. But I found the repair
operation causes lots of memtable flush and then the JVM GC failed.
Consequently, the node hang.

I am using the cassandra 3.1.0.

java version "1.8.0_231"
Java(TM) SE Runtime Environment (build 1.8.0_231-b32)
Java HotSpot(TM) 64-Bit Server VM (build 25.231-b32, mixed mode)

The node hardware is 32GB mem and 4 cores CPU. The heap is 16GB. For each
node, about 200 GB sstables.

The JVM hang is very fast. After the repair process starts, everything
works. I checked the memory, cpu and IO. No stress found. After some time
(maybe the streaming task is completing), the memtableflushwriter pending
task increasing very fast, and then GC failed. The JVM hang and the
heapdump created. When the issue happened, the CPU is in a low usage, and I
cannot find I/O latency on AWS EBS disk metrics.

Logs as below
WARN [Service Thread] 2020-04-02 05:07:15,104 GCInspector.java:282 -
ConcurrentMarkSweep GC in 6830ms. CMS Old Gen: 12265186360 -> 3201035496;
Par Eden Space: 671088640 -> 0; Par Survivor Space: 83886080 -> 0

13:07:01
INFO [Service Thread] 2020-04-02 05:07:15,104 StatusLogger.java:47 - Pool
Name Active Pending Completed Blocked All Time Blocked

13:07:01
WARN [ScheduledTasks:1] 2020-04-02 05:07:15,105
QueryProcessor.java:105 - 2 prepared
statements discarded in the last minute because cache limit reached (63 MB)

13:07:01
INFO [Service Thread] 2020-04-02 05:07:15,171 StatusLogger.java:51 -
MutationStage 32 70 145016 0 0
WARN [Service Thread] 2020-04-02 05:08:30,093 GCInspector.java:282 -
ConcurrentMarkSweep GC in 7490ms. CMS Old Gen: 16086342792 -> 9748777920;
WARN [Service Thread] 2020-04-02 05:09:57,548 GCInspector.java:282 -
ConcurrentMarkSweep GC in 7397ms. CMS Old Gen: 15141504128 -> 15001511696;
WARN  [Service Thread] 2020-04-02 05:10:11,207 GCInspector.java:282 -
ConcurrentMarkSweep GC in 6552ms.  CMS Old Gen: 16065021280 -> 16252475568;
Par Eden Space: 671088640 -> 0; Par Survivor Space: 83886080 -> 0
INFO  [Service Thread] 2020-04-02 05:10:11,224 StatusLogger.java:51 -
MemtableFlushWriter   2 10800  88712 0
 0


I checked the heap dump file. There are several big memtables objects of
the table repairing. The memtable objects size is about 400 - 700MB. And
the memtables are created in 20 seconds. In addition, I can see more than
12000 memtables. In these memtables, there are more than 6000
sstable_activity memtables.

At first, I suspect the memtable flush writer is the bottleneck. So I
increase it to 4 threads. And I double the memory of the node. But it
doesn't work. During repairing, the pending task increasing fast and then
the node hang again. I also decrease the repair token range, only one
vnode, but still failed.

We also see some logs during streaming task like this

WARN [STREAM-IN-/10.0.113.12:7000] 2020-04-02 05:05:57,150
BigTableWriter.java:211 - Writing large partition 

The writing sstables have 300 - 500 MBs. Some big one reaches 2+ GB.

I go through the cassandra source code. And I found the sstables must be
processed in normal write process if the table has a materialized view. So
I suspect the issue occurs in COMPLETE stage in streaming.

After streaming, the receive callback function loads the updated partition
sstables and create mutation as normal writes. So it increases memtables in
heap. In addition, it also invoke flush() which will create extra memtables
besides repaired tables. (for example, the sstable_activity in heap dump).
The memtables size exceeds the clean up threshold. So flush is called. But
flush cannot free enough memories. So many times of flush called. On the
other hand, the flush will increase the memtables, too.

Is my conclusion correct? If yes, how can I fix this issue?

--

Thanks
Gb


Re: Cassandra node JVM hang during node repair a table with materialized view

2020-04-15 Thread Ben G
Thanks a lot for your sharing.
The node is added recently.  The bootstrap failed since too many tombstone.
So we enabled the node without bootstrap enabled.  Some sstables are not
created in bootstrap.  So the missing files might be numerous.  I have set
the repair thread number is 1.  should I also set '-seq' flag? In fact, I
set '-seq' and '-pl' flag on a very small token range (only 2 long number
on a vnode), the JVM issue was not reproduced. But there are still
thousands of writer pending tasks in the peak.
 I have scaled up the EC2 RAM. Now 64 GB totally, and the JVM heap is 24 G
because less than 1/2 heap is recommended by document.  GC collector is
G1.  I ever repair the node after scale up. The JVM issue reproduced.  Can
I increase the heap to 40 GB on a 64GB VM?

Do you think the issue is related to materialized view or big partition?

Thanks

Erick Ramirez  于2020年4月16日周四 下午12:51写道:

> Is this the first time you've repaired your cluster? Because it sounds
> like it isn't coping. First thing you need to make sure of is to *not*
> run repairs in parallel. It can overload your cluster -- only kick off a
> repair one node at a time on small clusters. For larger clusters, you might
> be able to run it on multiple nodes but only on non-adjacent nodes (or
> nodes far enough around the ring from each other) where you absolutely know
> they don't have overlapping token ranges. If this doesn't make sense or is
> too complicated then just repair one node at a time.
>
> You should also consider running a partitioner-range repair (with the -pr
> flag) so you're only repairing ranges once. This is the quickest and most
> efficient way to repair since it doesn't repair overlapping token ranges
> multiple times. If you're interested, Jeremiah Jordan wrote a nice blog
> post explaining this in detail [1
> ].
>
> Third thing to consider is bumping up the heap on the nodes to 20GB. See
> how it goes. If you need to, maybe go as high as 24GB but understand the
> tradeoffs -- larger heaps mean that GC pauses are longer since there is
> more space to clean up. I also try to reserve 8GB of RAM for the operating
> system so on a 32GB system, 24GB is the most I would personally allocate to
> the heap (my opinion, YMMV).
>
> CMS also doesn't cope well with large heap sizes so depending on your use
> case/data model/access patterns/etc, you might need to switch to G1 GC if
> you really need to go upwards of 20GB. To be clear -- I'm not recommending
> that you switch to G1. I'm just saying that in my experience, CMS isn't
> great with large heap sizes. ;)
>
> Finally, 4 flush writers may be doing your nodes more harm than good since
> your nodes are on EBS, likely just a single volume. More is not always
> better so there's a word of warning for you. Again, YMMV. Cheers!
>
> [1] https://www.datastax.com/blog/2014/07/repair-cassandra
>
> GOT QUESTIONS? Apache Cassandra experts from the community and DataStax
> have answers! Share your expertise on https://community.datastax.com/.
>
>>

-- 

Thanks
Guo Bin


Re: Cassandra node JVM hang during node repair a table with materialized view

2020-04-16 Thread Ben G
Thanks a lot. We are working on removing views and control the partition
size.  I hope the improvements help us

Best regards

Gb

Erick Ramirez  于2020年4月16日周四 下午2:08写道:

> GC collector is G1.  I ever repair the node after scale up. The JVM issue
>> reproduced.  Can I increase the heap to 40 GB on a 64GB VM?
>>
>
> I wouldn't recommend going beyond 31GB on G1. It will be diminishing
> returns as I mentioned before.
>
> Do you think the issue is related to materialized view or big partition?
>>
>
> Yes, materialised views are problematic and I don't recommend them for
> production since they're still experimental. But if I were to guess, I'd
> say your problem is more an issue with large partitions and too many
> tombstones both putting pressure on the heap.
>
> The thing is if you can't bootstrap because you're running into the
> TombstoneOverwhelmException (I'm guessing), I can't see how you wouldn't
> run into it with repairs. In any case, try running repairs on the smaller
> tables first and work on the remaining tables one-by-one. But bootstrapping
> a node with repairs is a very expensive exercise than just plain old
> bootstrap. I get that you're in a tough spot right now so good luck!
>


-- 

Thanks
Guo Bin