Thanks a lot for taking time to check the log. We just switched from 400M to 1600M NEW size in the cassandra-env.sh. It reduced our latency and the PARNEW GC time / second significantly... (described here http://tech.shift.com/post/74311817513/cassandra-tuning-the-jvm-for-read-heavy-workloads )
Even when we had 400M the restart was behaving this way. We stop the node using : nodetool disablegossip && nodetool disablethrift && nodetool disablebinary && sleep 10 && nodetool drain && sleep 30 && service cassandra stop 2014-06-18 14:23 GMT+02:00 Jonathan Lacefield <jlacefi...@datastax.com>: > There are several long Parnew pauses that were recorded during startup. > The young gen size looks large too, if I am reading that line correctly. > Did you happen to overwrite the default settings for MAX_HEAP and/or NEW > size in the cassandra-env.sh? The large you gen size, set via the env.sh > file, could be causing longer than typical pauses, which could make your > node appear to be unresponsive and have high CPU (CPU for the ParNew GC > event). > > Check out this one - INFO 11:42:51,939 GC for ParNew: 2148 ms for 2 > collections, 1256307568 used; max is 8422162432 > That is a 2 second GC pause. That's very high for ParNew. We typically > want a lot of tiny ParNew events as opposed to large, and less frequent, > ParNew events. > > One other thing that was noticed, was that the node had a lot of log > segment replay's during startup. You could avoid these, or minimize them, > by preforming a flush or drain before stopping and starting Cassandra. > This will flush memtables and clear your log segments. > > > > Jonathan Lacefield > Solutions Architect, DataStax > (404) 822 3487 > <http://www.linkedin.com/in/jlacefield> > > <http://www.datastax.com/cassandrasummit14> > > > > On Wed, Jun 18, 2014 at 8:05 AM, Alain RODRIGUEZ <arodr...@gmail.com> > wrote: > >> A simple restart of a node with no changes give this result. >> >> logs output : https://gist.github.com/arodrime/db9ab152071d1ad39f26 >> >> Here are some screenshot: >> >> - htop from a node immediatly after restarting >> - opscenter ring view (show load cpu on all nodes) >> - opscenter dashboard shows the impact of a restart on latency (can >> affect writes or reads, it depends, reaction seems to be quite random) >> >> >> 2014-06-18 13:35 GMT+02:00 Jonathan Lacefield <jlacefi...@datastax.com>: >> >> Hello >>> >>> Have you checked the log file to see what's happening during startup >>> ? What caused the rolling restart? Did you preform an upgrade or >>> change a config? >>> >>> > On Jun 18, 2014, at 5:40 AM, Alain RODRIGUEZ <arodr...@gmail.com> >>> wrote: >>> > >>> > Hi guys >>> > >>> > Using 1.2.11, when I try to rolling restart the cluster, any node I >>> restart makes the whole cluster cpu load to increase, reaching a "red" >>> state in opscenter (load from 3-4 to 20+). This happens once the node is >>> back online. >>> > >>> > The restarted node uses 100 % cpu for 5 - 10 min and sometimes drop >>> mutations. >>> > >>> > I have tried to throttle handoff to 256 (instead of 1024), yet it >>> doesn't seems to help that much. >>> > >>> > Disks are not the bottleneck. PARNEW GC increase a bit, but nothing >>> problematic I think. >>> > >>> > Basically, what could be happening on node restart ? What is taking >>> that much CPU on every machine ? There is no steal or iowait. >>> > >>> > What can I try to tune ? >>> > >>> >> >> >