Thanks for the responses guys. I also suspected GC and I guess it could be it, since during the spikes logs are filled with messages like "GC for ConcurrentMarkSweep: 5908 ms for 1 collections, 1986282520 used; max is 8375238656", often right before messages about dropped queries, unlike other, unaffected, nodes that only have "GC for ParNew: 230 ms for 1 collections, 4418571760 used; max is 8375238656" type of messages.
Is my best shot to play with JVM settings trying to tune garbage collection then? On Thu, Sep 10, 2015 at 6:52 AM, Samuel CARRIERE <samuel.carri...@urssaf.fr> wrote: > Hi Roman, > If it affects only a subset of nodes and it's always the same ones, it > could be a "problem" with your data model : maybe some (too) wide rows on > theses nodes. > If one of your row is too wide, the deserialisation of the columns index > of this row can take a lot of resources (disk, RAM, and CPU). > If you are using leveled compaction strategy and you see anormaly big > sstables on thoses nodes, it could be a clue. > Regards, > Samuel > > Robert Wille <rwi...@fold3.com> a écrit sur 10/09/2015 15:27:41 : > > > De : Robert Wille <rwi...@fold3.com> > > A : "user@cassandra.apache.org" <user@cassandra.apache.org>, > > Date : 10/09/2015 15:30 > > Objet : Re: High CPU usage on some of nodes > > > > It sounds like its probably GC. Grep for GC in system.log to verify. > > If it is GC, there are a myriad of issues that could cause it, but > > at least you’ve narrowed it down. > > > > On Sep 9, 2015, at 11:05 PM, Roman Tkachenko <ro...@mailgunhq.com> > wrote: > > > > > Hey guys, > > > > > > We've been having issues in the past couple of days with CPU usage > > / load average suddenly skyrocketing on some nodes of the cluster, > > affecting performance significantly so majority of requests start > > timing out. It can go on for several hours, with CPU spiking through > > the roof then coming back down to norm and so on. Weirdly, it > > affects only a subset of nodes and it's always the same ones. The > > boxes Cassandra is running on are pretty beefy, 24 cores, and these > > CPU spikes go up to >1000%. > > > > > > What is the best way to debug such kind of issues and find out > > what Cassandra is doing during spikes like this? Doesn't seem to be > > compaction related as sometimes during these spikes "nodetool > > compactionstats" says no compactions are running. > > > > > > Thanks! > > > > > >