Re: getting status of long running repair

Bill Au Wed, 09 May 2012 05:50:15 -0700

I am running 1.0.8.  Two data center with 8 machines in each dc.  Nodes are
all up while repairing is running.  No dropped Mutations/Messages.  I do
see HintedHandoff messages.


Bill

On Tue, May 8, 2012 at 11:15 PM, Vijay <vijay2...@gmail.com> wrote:

> What is the version you are using? is it Multi DC setup? Are you seeing a
> lot of dropped Mutations/Messages? Are the nodes going up and down all the
> time while the repair is running?
>
> Regards,
> </VJ>
>
>
>
>
> On Tue, May 8, 2012 at 2:05 PM, Bill Au <bill.w...@gmail.com> wrote:
>
>> There are no error message in my log.
>>
>> I ended up restarting all the nodes in my cluster.  After that I was able
>> to run repair successfully on one of the node.  It took about 40 minutes.
>> Feeling lucky I ran repair on another node and it is stuck again.
>>
>> tpstats show 1 active and 1 pending AntiEntropySessions.  netstats and
>> compactionstats show no activity.  I took a close look at the log file, it
>> shows that the node requested merkle tree from 4 nodes (including itself).
>> It actually received 3 of those merkle trees.  It looks like it is stuck
>> waiting for that last one.  I checked the node where the request was sent
>> to, there isn't anything in the log on repair.  So it looks like the merkle
>> tree request has gotten lost some how.  It has been 8 hours since the
>> repair was issue and it is still stuck.  I am going to let it run a bit
>> longer to see if it will eventually finish.
>>
>> I have observed that if I restart all the nodes, I would be able to run
>> repair successfully on a single node.  I have done that twice already.  But
>> after that all repairs will hang.  Since we are supposed to run repair
>> periodically, having to restart all nodes before running repair on each
>> node isn't really viable for us.
>>
>> Bill
>>
>>
>> On Tue, May 8, 2012 at 6:04 AM, aaron morton <aa...@thelastpickle.com>wrote:
>>
>>> When you look in the logs please let me know if you see this error…
>>> https://issues.apache.org/jira/browse/CASSANDRA-4223
>>>
>>> I look at nodetool compactionstats (for the Merkle tree phase),
>>>  nodetool netstats for the streaming, and this to check for streaming
>>> progress:
>>>
>>> while true; do date; diff <(nodetool -h localhost netstats) <(sleep 5 &&
>>> nodetool -h localhost netstats); done
>>>
>>> Or use Data Stax Ops Centre where possible
>>> http://www.datastax.com/products/opscenter
>>>
>>> Cheers
>>>
>>>
>>>   -----------------
>>> Aaron Morton
>>> Freelance Developer
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>>
>>> On 8/05/2012, at 2:15 PM, Ben Coverston wrote:
>>>
>>> Check the log files for warnings or errors. They may indicate why your
>>> repair failed.
>>>
>>> On Mon, May 7, 2012 at 10:09 AM, Bill Au <bill.w...@gmail.com> wrote:
>>>
>>>> I restarted the nodes and then restarted the repair.  It is still
>>>> hanging like before.  Do I keep repeating until the repair actually finish?
>>>>
>>>> Bill
>>>>
>>>>
>>>> On Fri, May 4, 2012 at 2:18 PM, Rob Coli <rc...@palominodb.com> wrote:
>>>>
>>>>> On Fri, May 4, 2012 at 10:30 AM, Bill Au <bill.w...@gmail.com> wrote:
>>>>> > I know repair may take a long time to run.  I am running repair on a
>>>>> node
>>>>> > with about 15 GB of data and it is taking more than 24 hours.  Is
>>>>> that
>>>>> > normal?  Is there any way to get status of the repair?  tpstats does
>>>>> show 2
>>>>> > active and 2 pending AntiEntropySessions.  But netstats and
>>>>> compactionstats
>>>>> > show no activity.
>>>>>
>>>>> As indicated by various recent threads to this effect, many versions
>>>>> of cassandra (including current 1.0.x release) contain bugs which
>>>>> sometimes prevent repair from completing. The other threads suggest
>>>>> that some of these bugs result in the state you are in now, where you
>>>>> do not see anything that looks like appropriate activity.
>>>>> Unfortunately the only solution offered on these other threads is the
>>>>> one I will now offer, which is to restart the participating nodes and
>>>>> re-start the repair. I am unaware of any JIRA tickets tracking these
>>>>> bugs (which doesn't mean they don't exist, of course) so you might
>>>>> want to file one. :)
>>>>>
>>>>> =Rob
>>>>>
>>>>> --
>>>>> =Robert Coli
>>>>> AIM&GTALK - rc...@palominodb.com
>>>>> YAHOO - rcoli.palominob
>>>>> SKYPE - rcoli_palominodb
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Ben Coverston
>>> DataStax -- The Apache Cassandra Company
>>>
>>>
>>>
>>
>

Re: getting status of long running repair

Reply via email to