What takes the most CPU? System or User?
Did you try removing a problematic node and installing a brand new one
(instead of re-adding)?
When you decommissioned these nodes, did the high CPU "move" to other nodes
(probably data model/query issues) or was it completely gone? (server
issues)


On Sun, Oct 21, 2018 at 3:52 PM onmstester onmstester
<onmstes...@zoho.com.invalid> wrote:

> I don't think that root cause is related to Cassandra config, because the
> nodes are homogeneous and config for all of them are the same (16GB heap
> with default gc), also mutation counter and Native Transport counter is the
> same in all of the nodes, but only these 3 nodes experiencing 100% CPU
> usage (others have less than 20% CPU usage)
> I even decommissioned these 3 nodes from cluster and re-add them, but
> still the same
> The cluster is OK without these 3 nodes (in a state that these nodes are
> decommissioned)
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
> ============ Forwarded message ============
> From : Chris Lohfink <clohf...@apple.com>
> To : <user@cassandra.apache.org>
> Date : Sat, 20 Oct 2018 23:24:03 +0330
> Subject : Re: High CPU usage on some of the nodes due to message coalesce
> ============ Forwarded message ============
>
> 1s young gcs are horrible and likely cause of *some* of your bad metrics.
> How large are your mutations/query results and what gc/heap settings are
> you using?
>
> You can use https://github.com/aragozin/jvm-tools to see the threads
> generating allocation pressure and using the cpu (ttop) and what garbage is
> being created (hh --dead-young).
>
> Just a shot in the dark, I would *guess* you have rather large mutations
> putting pressure on commitlog and heap. G1 with a larger heap might help in
> that scenario to reduce fragmentation and adjust its eden and survivor
> regions to the allocation rate better (but give it a bigger reserve space)
> but theres limits to what can help if you cant change your workload.
> Without more info on schema etc its hard to tell but maybe that can help
> give you some ideas on places to look. It could just as likely be repair
> coordination, wide partition reads, or compactions so need to look more at
> what within the app is causing the pressure to know if its possible to
> improve with settings or if the load your application is producing exceeds
> what your cluster can handle (needs more nodes).
>
> Chris
>
> On Oct 20, 2018, at 5:18 AM, onmstester onmstester <
> onmstes...@zoho.com.INVALID> wrote:
>
> 3 nodes in my cluster have 100% cpu usage and most of it is used by
> org.apache.cassandra.util.coalesceInternal and SepWorker.run?
> The most active threads are the messaging-service-incomming.
> Other nodes are normal, having 30 nodes, using Rack Aware strategy. with
> 10 rack each having 3 nodes. The problematic nodes are configured for one
> rack, on normal write load, system.log reports too many hint message
> dropped (cross node). also there are alot of parNewGc with about 700-1000ms
> and commit log isolated disk, is utilized about 80-90%. on startup of these
> 3 nodes, there are alot of "updateing topology" logs (1000s of them
> pending).
> Using iperf, i'm sure that network is OK
> checking NTPs and mutations on each node, load is balanced among the nodes.
> using apache cassandra 3.11.2
> I can not not figure out the root cause of the problem, although there are
> some obvious symptoms.
>
> Best Regards
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
>
>

Reply via email to