Thanks a lot for taking time to check the log.

We just switched from 400M to 1600M NEW size in the cassandra-env.sh. It
reduced our latency and the PARNEW GC time / second significantly...
(described here
http://tech.shift.com/post/74311817513/cassandra-tuning-the-jvm-for-read-heavy-workloads
)

Even when we had 400M the restart was behaving this way.

We stop the node using : nodetool disablegossip && nodetool disablethrift
&& nodetool disablebinary && sleep 10 && nodetool drain && sleep 30 &&
service cassandra stop


2014-06-18 14:23 GMT+02:00 Jonathan Lacefield <jlacefi...@datastax.com>:

> There are several long Parnew pauses that were recorded during startup.
>  The young gen size looks large too, if I am reading that line correctly.
>  Did you happen to overwrite the default settings for MAX_HEAP and/or NEW
> size in the cassandra-env.sh?  The large you gen size, set via the env.sh
> file, could be causing longer than typical pauses, which could make your
> node appear to be unresponsive and have high CPU (CPU for the ParNew GC
> event).
>
> Check out this one - INFO 11:42:51,939 GC for ParNew: 2148 ms for 2
> collections, 1256307568 used; max is 8422162432
> That is a 2 second GC pause.  That's very high for ParNew.  We typically
> want a lot of tiny ParNew events as opposed to large, and less frequent,
> ParNew events.
>
> One other thing that was noticed, was that the node had a lot of log
> segment replay's during startup.  You could avoid these, or minimize them,
> by preforming a flush or drain before stopping and starting Cassandra.
>  This will flush memtables and clear your log segments.
>
>
>
> Jonathan Lacefield
> Solutions Architect, DataStax
> (404) 822 3487
>  <http://www.linkedin.com/in/jlacefield>
>
> <http://www.datastax.com/cassandrasummit14>
>
>
>
> On Wed, Jun 18, 2014 at 8:05 AM, Alain RODRIGUEZ <arodr...@gmail.com>
> wrote:
>
>> A simple restart of a node with no changes give this result.
>>
>> logs output : https://gist.github.com/arodrime/db9ab152071d1ad39f26
>>
>> Here are some screenshot:
>>
>> - htop from a node immediatly after restarting
>> - opscenter ring view (show load cpu on all nodes)
>> - opscenter dashboard shows the impact of a restart on latency (can
>> affect writes or reads, it depends, reaction seems to be quite random)
>>
>>
>> 2014-06-18 13:35 GMT+02:00 Jonathan Lacefield <jlacefi...@datastax.com>:
>>
>> Hello
>>>
>>>   Have you checked the log file to see what's happening during startup
>>> ?   What caused the rolling restart?  Did you preform an upgrade or
>>> change a config?
>>>
>>> > On Jun 18, 2014, at 5:40 AM, Alain RODRIGUEZ <arodr...@gmail.com>
>>> wrote:
>>> >
>>> > Hi guys
>>> >
>>> > Using 1.2.11, when I try to rolling restart the cluster, any node I
>>> restart makes the whole cluster cpu load to increase, reaching a "red"
>>> state in opscenter (load from 3-4 to 20+). This happens once the node is
>>> back online.
>>> >
>>> > The restarted node uses 100 % cpu for 5 - 10 min and sometimes drop
>>> mutations.
>>> >
>>> > I have tried to throttle handoff to 256 (instead of 1024), yet it
>>> doesn't seems to help that much.
>>> >
>>> > Disks are not the bottleneck. PARNEW GC increase a bit, but nothing
>>> problematic I think.
>>> >
>>> > Basically, what could be happening on node restart ? What is taking
>>> that much CPU on every machine ? There is no steal or iowait.
>>> >
>>> > What can I try to tune ?
>>> >
>>>
>>
>>
>

Reply via email to