We tried to dump the stack trace of threads, we noticed that "manual-repair-d08349af-189f-47cb-9cc3-452538ce04d1" daemon prio=10 tid=0x00000000406a3000 nid=0x1890 waiting on condition [0x00007f5c97be8000]
java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00007f5d4acf0f38> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(Unknown Source) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown Source) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(Unknown Source) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(Unknown Source) at java.util.concurrent.CountDownLatch.await(Unknown Source) at org.apache.cassandra.service.AntiEntropyService$RepairSession.run(AntiEntropyService.java:665) This seems to be the thread which causes the repair to hang. We also noticed another odd thing, sometimes we can see lots [WRITE-/...] threads. Thread [WRITE-/10.2.0.87] (Running) Thread [WRITE-/10.2.0.87] (Running) Thread [WRITE-/10.2.0.87] (Running) Thread [WRITE-/10.2.0.87] (Running) Thread [WRITE-/10.2.0.87] (Running) Thread [WRITE-/10.2.0.87] (Running) Thread [WRITE-/10.2.0.87] (Running) Thread [WRITE-/10.2.0.87] (Running) Thread [WRITE-/10.2.0.87] (Running) On Thu, Aug 25, 2011 at 11:10 AM, Boris Yen <yulin...@gmail.com> wrote: > Would Cassandra-2433 cause this? > > > On Wed, Aug 24, 2011 at 7:23 PM, Boris Yen <yulin...@gmail.com> wrote: > >> Hi, >> >> In our testing environment, we got two nodes with RF=2 running 0.8.4. We >> tried to test the repair functions of cassandra, however, every once a >> while, the "nodetool repair" never returns. We have checked the system.log, >> nothing seems to be out of ordinary, no errors, no exceptions. The data is >> only 50 mb, and it is consistently updated. >> >> Shutting down one node during the repair process could cause similar >> symptom. So, our original thought is that maybe one of the TreeRequest is >> not sent to the other node correctly, that might cause the repair to run >> forever. However, I did not see any relative log msg to support that. I am >> kind of running out of idea about this... Does anyone also has this problem? >> >> Regards >> Boris >> > >