Re: getting status of long running repair

Vijay Tue, 08 May 2012 20:16:17 -0700

What is the version you are using? is it Multi DC setup? Are you seeing a
lot of dropped Mutations/Messages? Are the nodes going up and down all the
time while the repair is running?


Regards,
</VJ>



On Tue, May 8, 2012 at 2:05 PM, Bill Au <bill.w...@gmail.com> wrote:

> There are no error message in my log.
>
> I ended up restarting all the nodes in my cluster.  After that I was able
> to run repair successfully on one of the node.  It took about 40 minutes.
> Feeling lucky I ran repair on another node and it is stuck again.
>
> tpstats show 1 active and 1 pending AntiEntropySessions.  netstats and
> compactionstats show no activity.  I took a close look at the log file, it
> shows that the node requested merkle tree from 4 nodes (including itself).
> It actually received 3 of those merkle trees.  It looks like it is stuck
> waiting for that last one.  I checked the node where the request was sent
> to, there isn't anything in the log on repair.  So it looks like the merkle
> tree request has gotten lost some how.  It has been 8 hours since the
> repair was issue and it is still stuck.  I am going to let it run a bit
> longer to see if it will eventually finish.
>
> I have observed that if I restart all the nodes, I would be able to run
> repair successfully on a single node.  I have done that twice already.  But
> after that all repairs will hang.  Since we are supposed to run repair
> periodically, having to restart all nodes before running repair on each
> node isn't really viable for us.
>
> Bill
>
>
> On Tue, May 8, 2012 at 6:04 AM, aaron morton <aa...@thelastpickle.com>wrote:
>
>> When you look in the logs please let me know if you see this error…
>> https://issues.apache.org/jira/browse/CASSANDRA-4223
>>
>> I look at nodetool compactionstats (for the Merkle tree phase),  nodetool
>> netstats for the streaming, and this to check for streaming progress:
>>
>> while true; do date; diff <(nodetool -h localhost netstats) <(sleep 5 &&
>> nodetool -h localhost netstats); done
>>
>> Or use Data Stax Ops Centre where possible
>> http://www.datastax.com/products/opscenter
>>
>> Cheers
>>
>>
>>   -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 8/05/2012, at 2:15 PM, Ben Coverston wrote:
>>
>> Check the log files for warnings or errors. They may indicate why your
>> repair failed.
>>
>> On Mon, May 7, 2012 at 10:09 AM, Bill Au <bill.w...@gmail.com> wrote:
>>
>>> I restarted the nodes and then restarted the repair.  It is still
>>> hanging like before.  Do I keep repeating until the repair actually finish?
>>>
>>> Bill
>>>
>>>
>>> On Fri, May 4, 2012 at 2:18 PM, Rob Coli <rc...@palominodb.com> wrote:
>>>
>>>> On Fri, May 4, 2012 at 10:30 AM, Bill Au <bill.w...@gmail.com> wrote:
>>>> > I know repair may take a long time to run.  I am running repair on a
>>>> node
>>>> > with about 15 GB of data and it is taking more than 24 hours.  Is that
>>>> > normal?  Is there any way to get status of the repair?  tpstats does
>>>> show 2
>>>> > active and 2 pending AntiEntropySessions.  But netstats and
>>>> compactionstats
>>>> > show no activity.
>>>>
>>>> As indicated by various recent threads to this effect, many versions
>>>> of cassandra (including current 1.0.x release) contain bugs which
>>>> sometimes prevent repair from completing. The other threads suggest
>>>> that some of these bugs result in the state you are in now, where you
>>>> do not see anything that looks like appropriate activity.
>>>> Unfortunately the only solution offered on these other threads is the
>>>> one I will now offer, which is to restart the participating nodes and
>>>> re-start the repair. I am unaware of any JIRA tickets tracking these
>>>> bugs (which doesn't mean they don't exist, of course) so you might
>>>> want to file one. :)
>>>>
>>>> =Rob
>>>>
>>>> --
>>>> =Robert Coli
>>>> AIM&GTALK - rc...@palominodb.com
>>>> YAHOO - rcoli.palominob
>>>> SKYPE - rcoli_palominodb
>>>>
>>>
>>>
>>
>>
>> --
>> Ben Coverston
>> DataStax -- The Apache Cassandra Company
>>
>>
>>
>

Re: getting status of long running repair

Reply via email to