We tried to dump the stack trace of threads, we noticed that

"manual-repair-d08349af-189f-47cb-9cc3-452538ce04d1" daemon prio=10
tid=0x00000000406a3000 nid=0x1890 waiting on condition [0x00007f5c97be8000]

   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007f5d4acf0f38> (a
java.util.concurrent.CountDownLatch$Sync)
        at java.util.concurrent.locks.LockSupport.park(Unknown Source)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown
Source)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(Unknown
Source)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(Unknown
Source)
        at java.util.concurrent.CountDownLatch.await(Unknown Source)

at
org.apache.cassandra.service.AntiEntropyService$RepairSession.run(AntiEntropyService.java:665)


This seems to be the thread which causes the repair to hang.

We also noticed another odd thing, sometimes we can see lots [WRITE-/...]
threads.

Thread [WRITE-/10.2.0.87] (Running)     
Thread [WRITE-/10.2.0.87] (Running)     
Thread [WRITE-/10.2.0.87] (Running)     
Thread [WRITE-/10.2.0.87] (Running)     
Thread [WRITE-/10.2.0.87] (Running)     
Thread [WRITE-/10.2.0.87] (Running)     
Thread [WRITE-/10.2.0.87] (Running)     
Thread [WRITE-/10.2.0.87] (Running)     
Thread [WRITE-/10.2.0.87] (Running)


On Thu, Aug 25, 2011 at 11:10 AM, Boris Yen <yulin...@gmail.com> wrote:

> Would Cassandra-2433 cause this?
>
>
> On Wed, Aug 24, 2011 at 7:23 PM, Boris Yen <yulin...@gmail.com> wrote:
>
>> Hi,
>>
>> In our testing environment, we got two nodes with RF=2 running 0.8.4. We
>> tried to test the repair functions of cassandra, however, every once a
>> while, the "nodetool repair" never returns. We have checked the system.log,
>> nothing seems to be out of ordinary, no errors, no exceptions. The data is
>> only 50 mb, and it is consistently updated.
>>
>> Shutting down one node during the repair process could cause similar
>> symptom. So, our original thought is that maybe one of the TreeRequest is
>> not sent to the other node correctly, that might cause the repair to run
>> forever. However, I did not see any relative log msg to support that. I am
>> kind of running out of idea about this... Does anyone also has this problem?
>>
>> Regards
>> Boris
>>
>
>

Reply via email to