I faced something similar in past and the reason for nodes becoming unresponsive intermittently was Long GC pauses. That's why I wanted to bring this to your attention incase GC pause is a potential cause.
Sent from my iPhone > On Jul 22, 2015, at 4:32 PM, Bryan Cheng <br...@blockcypher.com> wrote: > > Aiman, > > Your post made me look back at our data a bit. The most recent occurrence of > this incident was not preceded by any abnormal GC activity; however, the > previous occurrence (which took place a few days ago) did correspond to a > massive, order-of-magnitude increase in both ParNew and CMS collection times > which lasted ~17 hours. > > Was there something in particular that links GC to these stalls? At this > point in time, we cannot identify any particular reason for either that GC > spike or the subsequent apparent compaction stall, although it did not seem > to have any effect on our usage of the cluster. > >> On Wed, Jul 22, 2015 at 3:35 PM, Bryan Cheng <br...@blockcypher.com> wrote: >> Hi Aiman, >> >> We previously had issues with GC, but since upgrading to 2.1.7 things seem a >> lot healthier. >> >> We collect GC statistics through collectd via the garbage collector mbean, >> ParNew GC's report sub 500ms collection time on average (I believe >> accumulated per minute?) and CMS peaks at about 300ms collection time when >> it runs. >> >>> On Wed, Jul 22, 2015 at 3:22 PM, Aiman Parvaiz <ai...@flipagram.com> wrote: >>> Hi Bryan >>> How's GC behaving on these boxes? >>> >>>> On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng <br...@blockcypher.com> wrote: >>>> Hi there, >>>> >>>> Within our Cassandra cluster, we're observing, on occasion, one or two >>>> nodes at a time becoming partially unresponsive. >>>> >>>> We're running 2.1.7 across the entire cluster. >>>> >>>> nodetool still reports the node as being healthy, and it does respond to >>>> some local queries; however, the CPU is pegged at 100%. One common thread >>>> (heh) each time this happens is that there always seems to be one of more >>>> compaction threads running (via nodetool tpstats), and some appear to be >>>> stuck (active count doesn't change, pending count doesn't decrease). A >>>> request for compactionstats hangs with no response. >>>> >>>> Each time we've seen this, the only thing that appears to resolve the >>>> issue is a restart of the Cassandra process; the restart does not appear >>>> to be clean, and requires one or more attempts (or a -9 on occasion). >>>> >>>> There does not seem to be any pattern to what machines are affected; the >>>> nodes thus far have been different instances on different physical >>>> machines and on different racks. >>>> >>>> Has anyone seen this before? Alternatively, when this happens again, what >>>> data can we collect that would help with the debugging process (in >>>> addition to tpstats)? >>>> >>>> Thanks in advance, >>>> >>>> Bryan >>> >>> >>> >>> -- >>> Aiman Parvaiz >>> Lead Systems Architect >>> ai...@flipagram.com >>> cell: 213-300-6377 >>> http://flipagram.com/apz >