I don't think that root cause is related to Cassandra config, because the nodes 
are homogeneous and config for all of them are the same (16GB heap with default 
gc), also mutation counter and Native Transport counter is the same in all of 
the nodes, but only these 3 nodes experiencing 100% CPU usage (others have less 
than 20% CPU usage)  I even decommissioned these 3 nodes from cluster and 
re-add them, but still the same The cluster is OK without these 3 nodes (in a 
state that these nodes are decommissioned) Sent using Zoho Mail ============ 
Forwarded message ============ From : Chris Lohfink <clohf...@apple.com> To : 
<user@cassandra.apache.org> Date : Sat, 20 Oct 2018 23:24:03 +0330 Subject : 
Re: High CPU usage on some of the nodes due to message coalesce ============ 
Forwarded message ============ 1s young gcs are horrible and likely cause of 
some of your bad metrics. How large are your mutations/query results and what 
gc/heap settings are you using? You can use 
https://github.com/aragozin/jvm-tools to see the threads generating allocation 
pressure and using the cpu (ttop) and what garbage is being created (hh 
--dead-young). Just a shot in the dark, I would guess you have rather large 
mutations putting pressure on commitlog and heap. G1 with a larger heap might 
help in that scenario to reduce fragmentation and adjust its eden and survivor 
regions to the allocation rate better (but give it a bigger reserve space) but 
theres limits to what can help if you cant change your workload. Without more 
info on schema etc its hard to tell but maybe that can help give you some ideas 
on places to look. It could just as likely be repair coordination, wide 
partition reads, or compactions so need to look more at what within the app is 
causing the pressure to know if its possible to improve with settings or if the 
load your application is producing exceeds what your cluster can handle (needs 
more nodes). Chris On Oct 20, 2018, at 5:18 AM, onmstester onmstester 
<onmstes...@zoho.com.INVALID> wrote: 3 nodes in my cluster have 100% cpu usage 
and most of it is used by org.apache.cassandra.util.coalesceInternal and 
SepWorker.run? The most active threads are the messaging-service-incomming. 
Other nodes are normal, having 30 nodes, using Rack Aware strategy. with 10 
rack each having 3 nodes. The problematic nodes are configured for one rack, on 
normal write load, system.log reports too many hint message dropped (cross 
node). also there are alot of parNewGc with about 700-1000ms and commit log 
isolated disk, is utilized about 80-90%. on startup of these 3 nodes, there are 
alot of "updateing topology" logs (1000s of them pending). Using iperf, i'm 
sure that network is OK checking NTPs and mutations on each node, load is 
balanced among the nodes. using apache cassandra 3.11.2 I can not not figure 
out the root cause of the problem, although there are some obvious symptoms. 
Best Regards Sent using Zoho Mail

Reply via email to