This last command was supposed to be a best practice a few years ago, hope it is still the case. I just added the recent "nodetool disablebinary" part...
2014-06-18 14:36 GMT+02:00 Alain RODRIGUEZ <arodr...@gmail.com>: > Thanks a lot for taking time to check the log. > > We just switched from 400M to 1600M NEW size in the cassandra-env.sh. It > reduced our latency and the PARNEW GC time / second significantly... > (described here > http://tech.shift.com/post/74311817513/cassandra-tuning-the-jvm-for-read-heavy-workloads > ) > > Even when we had 400M the restart was behaving this way. > > We stop the node using : nodetool disablegossip && nodetool disablethrift > && nodetool disablebinary && sleep 10 && nodetool drain && sleep 30 && > service cassandra stop > > > 2014-06-18 14:23 GMT+02:00 Jonathan Lacefield <jlacefi...@datastax.com>: > > There are several long Parnew pauses that were recorded during startup. >> The young gen size looks large too, if I am reading that line correctly. >> Did you happen to overwrite the default settings for MAX_HEAP and/or NEW >> size in the cassandra-env.sh? The large you gen size, set via the env.sh >> file, could be causing longer than typical pauses, which could make your >> node appear to be unresponsive and have high CPU (CPU for the ParNew GC >> event). >> >> Check out this one - INFO 11:42:51,939 GC for ParNew: 2148 ms for 2 >> collections, 1256307568 used; max is 8422162432 >> That is a 2 second GC pause. That's very high for ParNew. We typically >> want a lot of tiny ParNew events as opposed to large, and less frequent, >> ParNew events. >> >> One other thing that was noticed, was that the node had a lot of log >> segment replay's during startup. You could avoid these, or minimize them, >> by preforming a flush or drain before stopping and starting Cassandra. >> This will flush memtables and clear your log segments. >> >> >> >> Jonathan Lacefield >> Solutions Architect, DataStax >> (404) 822 3487 >> <http://www.linkedin.com/in/jlacefield> >> >> <http://www.datastax.com/cassandrasummit14> >> >> >> >> On Wed, Jun 18, 2014 at 8:05 AM, Alain RODRIGUEZ <arodr...@gmail.com> >> wrote: >> >>> A simple restart of a node with no changes give this result. >>> >>> logs output : https://gist.github.com/arodrime/db9ab152071d1ad39f26 >>> >>> Here are some screenshot: >>> >>> - htop from a node immediatly after restarting >>> - opscenter ring view (show load cpu on all nodes) >>> - opscenter dashboard shows the impact of a restart on latency (can >>> affect writes or reads, it depends, reaction seems to be quite random) >>> >>> >>> 2014-06-18 13:35 GMT+02:00 Jonathan Lacefield <jlacefi...@datastax.com>: >>> >>> Hello >>>> >>>> Have you checked the log file to see what's happening during startup >>>> ? What caused the rolling restart? Did you preform an upgrade or >>>> change a config? >>>> >>>> > On Jun 18, 2014, at 5:40 AM, Alain RODRIGUEZ <arodr...@gmail.com> >>>> wrote: >>>> > >>>> > Hi guys >>>> > >>>> > Using 1.2.11, when I try to rolling restart the cluster, any node I >>>> restart makes the whole cluster cpu load to increase, reaching a "red" >>>> state in opscenter (load from 3-4 to 20+). This happens once the node is >>>> back online. >>>> > >>>> > The restarted node uses 100 % cpu for 5 - 10 min and sometimes drop >>>> mutations. >>>> > >>>> > I have tried to throttle handoff to 256 (instead of 1024), yet it >>>> doesn't seems to help that much. >>>> > >>>> > Disks are not the bottleneck. PARNEW GC increase a bit, but nothing >>>> problematic I think. >>>> > >>>> > Basically, what could be happening on node restart ? What is taking >>>> that much CPU on every machine ? There is no steal or iowait. >>>> > >>>> > What can I try to tune ? >>>> > >>>> >>> >>> >> >