Hi there, Within our Cassandra cluster, we're observing, on occasion, one or two nodes at a time becoming partially unresponsive.
We're running 2.1.7 across the entire cluster. nodetool still reports the node as being healthy, and it does respond to some local queries; however, the CPU is pegged at 100%. One common thread (heh) each time this happens is that there always seems to be one of more compaction threads running (via nodetool tpstats), and some appear to be stuck (active count doesn't change, pending count doesn't decrease). A request for compactionstats hangs with no response. Each time we've seen this, the only thing that appears to resolve the issue is a restart of the Cassandra process; the restart does not appear to be clean, and requires one or more attempts (or a -9 on occasion). There does not seem to be any pattern to what machines are affected; the nodes thus far have been different instances on different physical machines and on different racks. Has anyone seen this before? Alternatively, when this happens again, what data can we collect that would help with the debugging process (in addition to tpstats)? Thanks in advance, Bryan