Aiman, Your post made me look back at our data a bit. The most recent occurrence of this incident was not preceded by any abnormal GC activity; however, the previous occurrence (which took place a few days ago) did correspond to a massive, order-of-magnitude increase in both ParNew and CMS collection times which lasted ~17 hours.
Was there something in particular that links GC to these stalls? At this point in time, we cannot identify any particular reason for either that GC spike or the subsequent apparent compaction stall, although it did not seem to have any effect on our usage of the cluster. On Wed, Jul 22, 2015 at 3:35 PM, Bryan Cheng <br...@blockcypher.com> wrote: > Hi Aiman, > > We previously had issues with GC, but since upgrading to 2.1.7 things seem > a lot healthier. > > We collect GC statistics through collectd via the garbage collector mbean, > ParNew GC's report sub 500ms collection time on average (I believe > accumulated per minute?) and CMS peaks at about 300ms collection time when > it runs. > > On Wed, Jul 22, 2015 at 3:22 PM, Aiman Parvaiz <ai...@flipagram.com> > wrote: > >> Hi Bryan >> How's GC behaving on these boxes? >> >> On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng <br...@blockcypher.com> >> wrote: >> >>> Hi there, >>> >>> Within our Cassandra cluster, we're observing, on occasion, one or two >>> nodes at a time becoming partially unresponsive. >>> >>> We're running 2.1.7 across the entire cluster. >>> >>> nodetool still reports the node as being healthy, and it does respond to >>> some local queries; however, the CPU is pegged at 100%. One common thread >>> (heh) each time this happens is that there always seems to be one of more >>> compaction threads running (via nodetool tpstats), and some appear to be >>> stuck (active count doesn't change, pending count doesn't decrease). A >>> request for compactionstats hangs with no response. >>> >>> Each time we've seen this, the only thing that appears to resolve the >>> issue is a restart of the Cassandra process; the restart does not appear to >>> be clean, and requires one or more attempts (or a -9 on occasion). >>> >>> There does not seem to be any pattern to what machines are affected; the >>> nodes thus far have been different instances on different physical machines >>> and on different racks. >>> >>> Has anyone seen this before? Alternatively, when this happens again, >>> what data can we collect that would help with the debugging process (in >>> addition to tpstats)? >>> >>> Thanks in advance, >>> >>> Bryan >>> >> >> >> >> -- >> *Aiman Parvaiz* >> Lead Systems Architect >> ai...@flipagram.com >> cell: 213-300-6377 >> http://flipagram.com/apz >> > >