I faced something similar in past and the reason for nodes becoming 
unresponsive intermittently was Long GC pauses. That's why I wanted to bring 
this to your attention incase GC pause is a potential cause.

Sent from my iPhone

> On Jul 22, 2015, at 4:32 PM, Bryan Cheng <br...@blockcypher.com> wrote:
> 
> Aiman,
> 
> Your post made me look back at our data a bit. The most recent occurrence of 
> this incident was not preceded by any abnormal GC activity; however, the 
> previous occurrence (which took place a few days ago) did correspond to a 
> massive, order-of-magnitude increase in both ParNew and CMS collection times 
> which lasted ~17 hours.
> 
> Was there something in particular that links GC to these stalls? At this 
> point in time, we cannot identify any particular reason for either that GC 
> spike or the subsequent apparent compaction stall, although it did not seem 
> to have any effect on our usage of the cluster.
> 
>> On Wed, Jul 22, 2015 at 3:35 PM, Bryan Cheng <br...@blockcypher.com> wrote:
>> Hi Aiman,
>> 
>> We previously had issues with GC, but since upgrading to 2.1.7 things seem a 
>> lot healthier.
>> 
>> We collect GC statistics through collectd via the garbage collector mbean, 
>> ParNew GC's report sub 500ms collection time on average (I believe 
>> accumulated per minute?) and CMS peaks at about 300ms collection time when 
>> it runs.
>> 
>>> On Wed, Jul 22, 2015 at 3:22 PM, Aiman Parvaiz <ai...@flipagram.com> wrote:
>>> Hi Bryan
>>> How's GC behaving on these boxes?
>>> 
>>>> On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng <br...@blockcypher.com> wrote:
>>>> Hi there,
>>>> 
>>>> Within our Cassandra cluster, we're observing, on occasion, one or two 
>>>> nodes at a time becoming partially unresponsive.
>>>> 
>>>> We're running 2.1.7 across the entire cluster.
>>>> 
>>>> nodetool still reports the node as being healthy, and it does respond to 
>>>> some local queries; however, the CPU is pegged at 100%. One common thread 
>>>> (heh) each time this happens is that there always seems to be one of more 
>>>> compaction threads running (via nodetool tpstats), and some appear to be 
>>>> stuck (active count doesn't change, pending count doesn't decrease). A 
>>>> request for compactionstats hangs with no response.
>>>> 
>>>> Each time we've seen this, the only thing that appears to resolve the 
>>>> issue is a restart of the Cassandra process; the restart does not appear 
>>>> to be clean, and requires one or more attempts (or a -9 on occasion).
>>>> 
>>>> There does not seem to be any pattern to what machines are affected; the 
>>>> nodes thus far have been different instances on different physical 
>>>> machines and on different racks.
>>>> 
>>>> Has anyone seen this before? Alternatively, when this happens again, what 
>>>> data can we collect that would help with the debugging process (in 
>>>> addition to tpstats)?
>>>> 
>>>> Thanks in advance,
>>>> 
>>>> Bryan
>>> 
>>> 
>>> 
>>> -- 
>>> Aiman Parvaiz
>>> Lead Systems Architect
>>> ai...@flipagram.com
>>> cell: 213-300-6377
>>> http://flipagram.com/apz
> 

Reply via email to