I am currently running a 4 node cluster on Cassandra beta 2. Yesterday, I ran into a number of problems and the one of my nodes went down for a few hours. I tried to run a nodetool repair and at least at a data level, everything seems to be consistent and alright. The problem is that the node is still chewing up 100% of its available CPU, 20 hours after I started the repair. Load averages are 8-9 which is crazy given it is a single core ec2 m1.small.
Besides sitting at 100% cpu, the node on which I ran the repair seems to be fine. The Cassandra logs appear normal. Based on bandwidth patterns between nodes, it does not seem like they are transferring any repair related data (as they did initially). No pending tasks are being shown in any of the services when inspecting via jmx. I have a reasonable amount of data in the cluster (~6 gb * 2 replication factor) but nothing crazy. The last repair related entry in the logs is as follows: INFO [Thread-145] 2010-10-22 00:24:10,561 AntiEntropyService.java (line 828) #<TreeRequest manual-repair-23dacf4b-4076-4460-abd5-a713bfd090e2, /10.192.227.6, (kikmetrics,PacketEventsByPacket)> completed successfully: 14 outstanding. Any idea what is going on? Could the CPU usage STILL be related to the repair? Is there any way to check? I hesitate to simply kill the node given the "14 outstanding" log message and as doing so has caused me problems in the past when using beta versions. Dan Hendry