Re: getting status of long running repair

Bill Au Tue, 08 May 2012 14:05:42 -0700

There are no error message in my log.

I ended up restarting all the nodes in my cluster.  After that I was able
to run repair successfully on one of the node.  It took about 40 minutes.
Feeling lucky I ran repair on another node and it is stuck again.


tpstats show 1 active and 1 pending AntiEntropySessions.  netstats and
compactionstats show no activity.  I took a close look at the log file, it
shows that the node requested merkle tree from 4 nodes (including itself).
It actually received 3 of those merkle trees.  It looks like it is stuck
waiting for that last one.  I checked the node where the request was sent
to, there isn't anything in the log on repair.  So it looks like the merkle
tree request has gotten lost some how.  It has been 8 hours since the
repair was issue and it is still stuck.  I am going to let it run a bit
longer to see if it will eventually finish.

I have observed that if I restart all the nodes, I would be able to run
repair successfully on a single node.  I have done that twice already.  But
after that all repairs will hang.  Since we are supposed to run repair
periodically, having to restart all nodes before running repair on each
node isn't really viable for us.

Bill

On Tue, May 8, 2012 at 6:04 AM, aaron morton <aa...@thelastpickle.com>wrote:

> When you look in the logs please let me know if you see this error…
> https://issues.apache.org/jira/browse/CASSANDRA-4223
>
> I look at nodetool compactionstats (for the Merkle tree phase),  nodetool
> netstats for the streaming, and this to check for streaming progress:
>
> while true; do date; diff <(nodetool -h localhost netstats) <(sleep 5 &&
> nodetool -h localhost netstats); done
>
> Or use Data Stax Ops Centre where possible
> http://www.datastax.com/products/opscenter
>
> Cheers
>
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 8/05/2012, at 2:15 PM, Ben Coverston wrote:
>
> Check the log files for warnings or errors. They may indicate why your
> repair failed.
>
> On Mon, May 7, 2012 at 10:09 AM, Bill Au <bill.w...@gmail.com> wrote:
>
>> I restarted the nodes and then restarted the repair.  It is still hanging
>> like before.  Do I keep repeating until the repair actually finish?
>>
>> Bill
>>
>>
>> On Fri, May 4, 2012 at 2:18 PM, Rob Coli <rc...@palominodb.com> wrote:
>>
>>> On Fri, May 4, 2012 at 10:30 AM, Bill Au <bill.w...@gmail.com> wrote:
>>> > I know repair may take a long time to run.  I am running repair on a
>>> node
>>> > with about 15 GB of data and it is taking more than 24 hours.  Is that
>>> > normal?  Is there any way to get status of the repair?  tpstats does
>>> show 2
>>> > active and 2 pending AntiEntropySessions.  But netstats and
>>> compactionstats
>>> > show no activity.
>>>
>>> As indicated by various recent threads to this effect, many versions
>>> of cassandra (including current 1.0.x release) contain bugs which
>>> sometimes prevent repair from completing. The other threads suggest
>>> that some of these bugs result in the state you are in now, where you
>>> do not see anything that looks like appropriate activity.
>>> Unfortunately the only solution offered on these other threads is the
>>> one I will now offer, which is to restart the participating nodes and
>>> re-start the repair. I am unaware of any JIRA tickets tracking these
>>> bugs (which doesn't mean they don't exist, of course) so you might
>>> want to file one. :)
>>>
>>> =Rob
>>>
>>> --
>>> =Robert Coli
>>> AIM&GTALK - rc...@palominodb.com
>>> YAHOO - rcoli.palominob
>>> SKYPE - rcoli_palominodb
>>>
>>
>>
>
>
> --
> Ben Coverston
> DataStax -- The Apache Cassandra Company
>
>
>

Re: getting status of long running repair

Reply via email to