What takes the most CPU? System or User? Did you try removing a problematic node and installing a brand new one (instead of re-adding)? When you decommissioned these nodes, did the high CPU "move" to other nodes (probably data model/query issues) or was it completely gone? (server issues)
On Sun, Oct 21, 2018 at 3:52 PM onmstester onmstester <onmstes...@zoho.com.invalid> wrote: > I don't think that root cause is related to Cassandra config, because the > nodes are homogeneous and config for all of them are the same (16GB heap > with default gc), also mutation counter and Native Transport counter is the > same in all of the nodes, but only these 3 nodes experiencing 100% CPU > usage (others have less than 20% CPU usage) > I even decommissioned these 3 nodes from cluster and re-add them, but > still the same > The cluster is OK without these 3 nodes (in a state that these nodes are > decommissioned) > > Sent using Zoho Mail <https://www.zoho.com/mail/> > > > ============ Forwarded message ============ > From : Chris Lohfink <clohf...@apple.com> > To : <user@cassandra.apache.org> > Date : Sat, 20 Oct 2018 23:24:03 +0330 > Subject : Re: High CPU usage on some of the nodes due to message coalesce > ============ Forwarded message ============ > > 1s young gcs are horrible and likely cause of *some* of your bad metrics. > How large are your mutations/query results and what gc/heap settings are > you using? > > You can use https://github.com/aragozin/jvm-tools to see the threads > generating allocation pressure and using the cpu (ttop) and what garbage is > being created (hh --dead-young). > > Just a shot in the dark, I would *guess* you have rather large mutations > putting pressure on commitlog and heap. G1 with a larger heap might help in > that scenario to reduce fragmentation and adjust its eden and survivor > regions to the allocation rate better (but give it a bigger reserve space) > but theres limits to what can help if you cant change your workload. > Without more info on schema etc its hard to tell but maybe that can help > give you some ideas on places to look. It could just as likely be repair > coordination, wide partition reads, or compactions so need to look more at > what within the app is causing the pressure to know if its possible to > improve with settings or if the load your application is producing exceeds > what your cluster can handle (needs more nodes). > > Chris > > On Oct 20, 2018, at 5:18 AM, onmstester onmstester < > onmstes...@zoho.com.INVALID> wrote: > > 3 nodes in my cluster have 100% cpu usage and most of it is used by > org.apache.cassandra.util.coalesceInternal and SepWorker.run? > The most active threads are the messaging-service-incomming. > Other nodes are normal, having 30 nodes, using Rack Aware strategy. with > 10 rack each having 3 nodes. The problematic nodes are configured for one > rack, on normal write load, system.log reports too many hint message > dropped (cross node). also there are alot of parNewGc with about 700-1000ms > and commit log isolated disk, is utilized about 80-90%. on startup of these > 3 nodes, there are alot of "updateing topology" logs (1000s of them > pending). > Using iperf, i'm sure that network is OK > checking NTPs and mutations on each node, load is balanced among the nodes. > using apache cassandra 3.11.2 > I can not not figure out the root cause of the problem, although there are > some obvious symptoms. > > Best Regards > > Sent using Zoho Mail <https://www.zoho.com/mail/> > > > >