Hung Repair

Dan Hendry Fri, 22 Oct 2010 14:43:08 -0700

I am currently running a 4 node cluster on Cassandra beta 2. Yesterday, I
ran into a number of problems and the one of my nodes went down for a few
hours. I tried to run a nodetool repair and at least at a data level,
everything seems to be consistent and alright. The problem is that the node
is still chewing up 100% of its available CPU, 20 hours after I started the
repair. Load averages are 8-9 which is crazy given it is a single core ec2
m1.small.


 

Besides sitting at 100% cpu, the node on which I ran the repair seems to be
fine. The Cassandra logs appear normal. Based on bandwidth patterns between
nodes, it does not seem like they are transferring any repair related data
(as they did initially). No pending tasks are being shown in any of the
services when inspecting via jmx. I have a reasonable amount of data in the
cluster (~6 gb * 2 replication factor) but nothing crazy. The last repair
related entry in the logs is as follows:

 

INFO [Thread-145] 2010-10-22 00:24:10,561 AntiEntropyService.java (line 828)
#<TreeRequest manual-repair-23dacf4b-4076-4460-abd5-a713bfd090e2,
/10.192.227.6, (kikmetrics,PacketEventsByPacket)> completed successfully: 14
outstanding.

 


Any idea what is going on? Could the CPU usage STILL be related to the
repair? Is there any way to check? I hesitate to simply kill the node given
the "14 outstanding" log message and as doing so has caused me problems in
the past when using beta versions.

 

 

Dan Hendry

Hung Repair

Reply via email to