That's a thread waiting for other threads / activities to complete. Nothing unusual there.
Work out how fair the repair gets. Is there a validation compaction listed in nodetool compactionstats ? Are there any streams running in nodetool netstats ? Look through the logs on the machine you start the repair on, follow the messages from the AnitEntrophyService. They will say when they send messages to other nodes to build the merkle tree and when they get the response back. You can then check if the other nodes respond. Hope that helps. ----------------- Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 25/08/2011, at 7:02 PM, Boris Yen wrote: > We tried to dump the stack trace of threads, we noticed that > > "manual-repair-d08349af-189f-47cb-9cc3-452538ce04d1" daemon prio=10 > tid=0x00000000406a3000 nid=0x1890 waiting on condition [0x00007f5c97be8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00007f5d4acf0f38> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(Unknown Source) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown > Source) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(Unknown > Source) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(Unknown > Source) > at java.util.concurrent.CountDownLatch.await(Unknown Source) > at > org.apache.cassandra.service.AntiEntropyService$RepairSession.run(AntiEntropyService.java:665) > > > This seems to be the thread which causes the repair to hang. > > We also noticed another odd thing, sometimes we can see lots [WRITE-/...] > threads. > Thread [WRITE-/10.2.0.87] (Running) > Thread [WRITE-/10.2.0.87] (Running) > Thread [WRITE-/10.2.0.87] (Running) > Thread [WRITE-/10.2.0.87] (Running) > Thread [WRITE-/10.2.0.87] (Running) > Thread [WRITE-/10.2.0.87] (Running) > Thread [WRITE-/10.2.0.87] (Running) > Thread [WRITE-/10.2.0.87] (Running) > Thread [WRITE-/10.2.0.87] (Running) > > On Thu, Aug 25, 2011 at 11:10 AM, Boris Yen <yulin...@gmail.com> wrote: > Would Cassandra-2433 cause this? > > > On Wed, Aug 24, 2011 at 7:23 PM, Boris Yen <yulin...@gmail.com> wrote: > Hi, > > In our testing environment, we got two nodes with RF=2 running 0.8.4. We > tried to test the repair functions of cassandra, however, every once a while, > the "nodetool repair" never returns. We have checked the system.log, nothing > seems to be out of ordinary, no errors, no exceptions. The data is only 50 > mb, and it is consistently updated. > > Shutting down one node during the repair process could cause similar symptom. > So, our original thought is that maybe one of the TreeRequest is not sent to > the other node correctly, that might cause the repair to run forever. > However, I did not see any relative log msg to support that. I am kind of > running out of idea about this... Does anyone also has this problem? > > Regards > Boris > >