I am running 1.0.8. Two data center with 8 machines in each dc. Nodes are all up while repairing is running. No dropped Mutations/Messages. I do see HintedHandoff messages.
Bill On Tue, May 8, 2012 at 11:15 PM, Vijay <vijay2...@gmail.com> wrote: > What is the version you are using? is it Multi DC setup? Are you seeing a > lot of dropped Mutations/Messages? Are the nodes going up and down all the > time while the repair is running? > > Regards, > </VJ> > > > > > On Tue, May 8, 2012 at 2:05 PM, Bill Au <bill.w...@gmail.com> wrote: > >> There are no error message in my log. >> >> I ended up restarting all the nodes in my cluster. After that I was able >> to run repair successfully on one of the node. It took about 40 minutes. >> Feeling lucky I ran repair on another node and it is stuck again. >> >> tpstats show 1 active and 1 pending AntiEntropySessions. netstats and >> compactionstats show no activity. I took a close look at the log file, it >> shows that the node requested merkle tree from 4 nodes (including itself). >> It actually received 3 of those merkle trees. It looks like it is stuck >> waiting for that last one. I checked the node where the request was sent >> to, there isn't anything in the log on repair. So it looks like the merkle >> tree request has gotten lost some how. It has been 8 hours since the >> repair was issue and it is still stuck. I am going to let it run a bit >> longer to see if it will eventually finish. >> >> I have observed that if I restart all the nodes, I would be able to run >> repair successfully on a single node. I have done that twice already. But >> after that all repairs will hang. Since we are supposed to run repair >> periodically, having to restart all nodes before running repair on each >> node isn't really viable for us. >> >> Bill >> >> >> On Tue, May 8, 2012 at 6:04 AM, aaron morton <aa...@thelastpickle.com>wrote: >> >>> When you look in the logs please let me know if you see this error… >>> https://issues.apache.org/jira/browse/CASSANDRA-4223 >>> >>> I look at nodetool compactionstats (for the Merkle tree phase), >>> nodetool netstats for the streaming, and this to check for streaming >>> progress: >>> >>> while true; do date; diff <(nodetool -h localhost netstats) <(sleep 5 && >>> nodetool -h localhost netstats); done >>> >>> Or use Data Stax Ops Centre where possible >>> http://www.datastax.com/products/opscenter >>> >>> Cheers >>> >>> >>> ----------------- >>> Aaron Morton >>> Freelance Developer >>> @aaronmorton >>> http://www.thelastpickle.com >>> >>> On 8/05/2012, at 2:15 PM, Ben Coverston wrote: >>> >>> Check the log files for warnings or errors. They may indicate why your >>> repair failed. >>> >>> On Mon, May 7, 2012 at 10:09 AM, Bill Au <bill.w...@gmail.com> wrote: >>> >>>> I restarted the nodes and then restarted the repair. It is still >>>> hanging like before. Do I keep repeating until the repair actually finish? >>>> >>>> Bill >>>> >>>> >>>> On Fri, May 4, 2012 at 2:18 PM, Rob Coli <rc...@palominodb.com> wrote: >>>> >>>>> On Fri, May 4, 2012 at 10:30 AM, Bill Au <bill.w...@gmail.com> wrote: >>>>> > I know repair may take a long time to run. I am running repair on a >>>>> node >>>>> > with about 15 GB of data and it is taking more than 24 hours. Is >>>>> that >>>>> > normal? Is there any way to get status of the repair? tpstats does >>>>> show 2 >>>>> > active and 2 pending AntiEntropySessions. But netstats and >>>>> compactionstats >>>>> > show no activity. >>>>> >>>>> As indicated by various recent threads to this effect, many versions >>>>> of cassandra (including current 1.0.x release) contain bugs which >>>>> sometimes prevent repair from completing. The other threads suggest >>>>> that some of these bugs result in the state you are in now, where you >>>>> do not see anything that looks like appropriate activity. >>>>> Unfortunately the only solution offered on these other threads is the >>>>> one I will now offer, which is to restart the participating nodes and >>>>> re-start the repair. I am unaware of any JIRA tickets tracking these >>>>> bugs (which doesn't mean they don't exist, of course) so you might >>>>> want to file one. :) >>>>> >>>>> =Rob >>>>> >>>>> -- >>>>> =Robert Coli >>>>> AIM>ALK - rc...@palominodb.com >>>>> YAHOO - rcoli.palominob >>>>> SKYPE - rcoli_palominodb >>>>> >>>> >>>> >>> >>> >>> -- >>> Ben Coverston >>> DataStax -- The Apache Cassandra Company >>> >>> >>> >> >