Cassandra compaction appears to stall, node becomes partially unresponsive

Bryan Cheng Wed, 22 Jul 2015 14:56:19 -0700

Hi there,

Within our Cassandra cluster, we're observing, on occasion, one or two
nodes at a time becoming partially unresponsive.


We're running 2.1.7 across the entire cluster.

nodetool still reports the node as being healthy, and it does respond to
some local queries; however, the CPU is pegged at 100%. One common thread
(heh) each time this happens is that there always seems to be one of more
compaction threads running (via nodetool tpstats), and some appear to be
stuck (active count doesn't change, pending count doesn't decrease). A
request for compactionstats hangs with no response.

Each time we've seen this, the only thing that appears to resolve the issue
is a restart of the Cassandra process; the restart does not appear to be
clean, and requires one or more attempts (or a -9 on occasion).

There does not seem to be any pattern to what machines are affected; the
nodes thus far have been different instances on different physical machines
and on different racks.

Has anyone seen this before? Alternatively, when this happens again, what
data can we collect that would help with the debugging process (in addition
to tpstats)?

Thanks in advance,

Bryan

Cassandra compaction appears to stall, node becomes partially unresponsive

Reply via email to