Hi Bryan How's GC behaving on these boxes? On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng <br...@blockcypher.com> wrote:
> Hi there, > > Within our Cassandra cluster, we're observing, on occasion, one or two > nodes at a time becoming partially unresponsive. > > We're running 2.1.7 across the entire cluster. > > nodetool still reports the node as being healthy, and it does respond to > some local queries; however, the CPU is pegged at 100%. One common thread > (heh) each time this happens is that there always seems to be one of more > compaction threads running (via nodetool tpstats), and some appear to be > stuck (active count doesn't change, pending count doesn't decrease). A > request for compactionstats hangs with no response. > > Each time we've seen this, the only thing that appears to resolve the > issue is a restart of the Cassandra process; the restart does not appear to > be clean, and requires one or more attempts (or a -9 on occasion). > > There does not seem to be any pattern to what machines are affected; the > nodes thus far have been different instances on different physical machines > and on different racks. > > Has anyone seen this before? Alternatively, when this happens again, what > data can we collect that would help with the debugging process (in addition > to tpstats)? > > Thanks in advance, > > Bryan > -- *Aiman Parvaiz* Lead Systems Architect ai...@flipagram.com cell: 213-300-6377 http://flipagram.com/apz