Re: nodetool repair does not return...

aaron morton Thu, 25 Aug 2011 15:08:07 -0700

That's a thread waiting for other threads / activities to complete. Nothing 
unusual there.


Work out how fair the repair gets. Is there a validation compaction listed in 
nodetool compactionstats ? Are there any streams running in nodetool netstats ? 


Look through the logs on the machine you start the repair on, follow the 
messages from the AnitEntrophyService. They will say when they send messages to 
other nodes to build the merkle tree and when they get the response back. You 
can then check if the other nodes respond. 

Hope that helps. 

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 25/08/2011, at 7:02 PM, Boris Yen wrote:

> We tried to dump the stack trace of threads, we noticed that
> 
> "manual-repair-d08349af-189f-47cb-9cc3-452538ce04d1" daemon prio=10 
> tid=0x00000000406a3000 nid=0x1890 waiting on condition [0x00007f5c97be8000]
>    java.lang.Thread.State: WAITING (parking)
>       at sun.misc.Unsafe.park(Native Method)
>       - parking to wait for  <0x00007f5d4acf0f38> (a 
> java.util.concurrent.CountDownLatch$Sync)
>       at java.util.concurrent.locks.LockSupport.park(Unknown Source)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown
>  Source)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(Unknown
>  Source)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(Unknown
>  Source)
>       at java.util.concurrent.CountDownLatch.await(Unknown Source) 
>       at 
> org.apache.cassandra.service.AntiEntropyService$RepairSession.run(AntiEntropyService.java:665)
>  
> 
> This seems to be the thread which causes the repair to hang.
> 
> We also noticed another odd thing, sometimes we can see lots [WRITE-/...] 
> threads.
> Thread [WRITE-/10.2.0.87] (Running)   
> Thread [WRITE-/10.2.0.87] (Running)   
> Thread [WRITE-/10.2.0.87] (Running)   
> Thread [WRITE-/10.2.0.87] (Running)   
> Thread [WRITE-/10.2.0.87] (Running)   
> Thread [WRITE-/10.2.0.87] (Running)   
> Thread [WRITE-/10.2.0.87] (Running)   
> Thread [WRITE-/10.2.0.87] (Running)   
> Thread [WRITE-/10.2.0.87] (Running)
> 
> On Thu, Aug 25, 2011 at 11:10 AM, Boris Yen <[email protected]> wrote:
> Would Cassandra-2433 cause this?
> 
> 
> On Wed, Aug 24, 2011 at 7:23 PM, Boris Yen <[email protected]> wrote:
> Hi,
> 
> In our testing environment, we got two nodes with RF=2 running 0.8.4. We 
> tried to test the repair functions of cassandra, however, every once a while, 
> the "nodetool repair" never returns. We have checked the system.log, nothing 
> seems to be out of ordinary, no errors, no exceptions. The data is only 50 
> mb, and it is consistently updated.
> 
> Shutting down one node during the repair process could cause similar symptom. 
> So, our original thought is that maybe one of the TreeRequest is not sent to 
> the other node correctly, that might cause the repair to run forever. 
> However, I did not see any relative log msg to support that. I am kind of 
> running out of idea about this... Does anyone also has this problem?
> 
> Regards
> Boris
> 
>

Re: nodetool repair does not return...

Reply via email to