What is the version you are using? is it Multi DC setup? Are you seeing a lot of dropped Mutations/Messages? Are the nodes going up and down all the time while the repair is running?
Regards, </VJ> On Tue, May 8, 2012 at 2:05 PM, Bill Au <bill.w...@gmail.com> wrote: > There are no error message in my log. > > I ended up restarting all the nodes in my cluster. After that I was able > to run repair successfully on one of the node. It took about 40 minutes. > Feeling lucky I ran repair on another node and it is stuck again. > > tpstats show 1 active and 1 pending AntiEntropySessions. netstats and > compactionstats show no activity. I took a close look at the log file, it > shows that the node requested merkle tree from 4 nodes (including itself). > It actually received 3 of those merkle trees. It looks like it is stuck > waiting for that last one. I checked the node where the request was sent > to, there isn't anything in the log on repair. So it looks like the merkle > tree request has gotten lost some how. It has been 8 hours since the > repair was issue and it is still stuck. I am going to let it run a bit > longer to see if it will eventually finish. > > I have observed that if I restart all the nodes, I would be able to run > repair successfully on a single node. I have done that twice already. But > after that all repairs will hang. Since we are supposed to run repair > periodically, having to restart all nodes before running repair on each > node isn't really viable for us. > > Bill > > > On Tue, May 8, 2012 at 6:04 AM, aaron morton <aa...@thelastpickle.com>wrote: > >> When you look in the logs please let me know if you see this error… >> https://issues.apache.org/jira/browse/CASSANDRA-4223 >> >> I look at nodetool compactionstats (for the Merkle tree phase), nodetool >> netstats for the streaming, and this to check for streaming progress: >> >> while true; do date; diff <(nodetool -h localhost netstats) <(sleep 5 && >> nodetool -h localhost netstats); done >> >> Or use Data Stax Ops Centre where possible >> http://www.datastax.com/products/opscenter >> >> Cheers >> >> >> ----------------- >> Aaron Morton >> Freelance Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 8/05/2012, at 2:15 PM, Ben Coverston wrote: >> >> Check the log files for warnings or errors. They may indicate why your >> repair failed. >> >> On Mon, May 7, 2012 at 10:09 AM, Bill Au <bill.w...@gmail.com> wrote: >> >>> I restarted the nodes and then restarted the repair. It is still >>> hanging like before. Do I keep repeating until the repair actually finish? >>> >>> Bill >>> >>> >>> On Fri, May 4, 2012 at 2:18 PM, Rob Coli <rc...@palominodb.com> wrote: >>> >>>> On Fri, May 4, 2012 at 10:30 AM, Bill Au <bill.w...@gmail.com> wrote: >>>> > I know repair may take a long time to run. I am running repair on a >>>> node >>>> > with about 15 GB of data and it is taking more than 24 hours. Is that >>>> > normal? Is there any way to get status of the repair? tpstats does >>>> show 2 >>>> > active and 2 pending AntiEntropySessions. But netstats and >>>> compactionstats >>>> > show no activity. >>>> >>>> As indicated by various recent threads to this effect, many versions >>>> of cassandra (including current 1.0.x release) contain bugs which >>>> sometimes prevent repair from completing. The other threads suggest >>>> that some of these bugs result in the state you are in now, where you >>>> do not see anything that looks like appropriate activity. >>>> Unfortunately the only solution offered on these other threads is the >>>> one I will now offer, which is to restart the participating nodes and >>>> re-start the repair. I am unaware of any JIRA tickets tracking these >>>> bugs (which doesn't mean they don't exist, of course) so you might >>>> want to file one. :) >>>> >>>> =Rob >>>> >>>> -- >>>> =Robert Coli >>>> AIM>ALK - rc...@palominodb.com >>>> YAHOO - rcoli.palominob >>>> SKYPE - rcoli_palominodb >>>> >>> >>> >> >> >> -- >> Ben Coverston >> DataStax -- The Apache Cassandra Company >> >> >> >