Re: Repair Process Taking too long

Sylvain Lebresne Thu, 12 Apr 2012 07:59:49 -0700

On Thu, Apr 12, 2012 at 4:06 PM, Frank Ng <buzzt...@gmail.com> wrote:
> I also noticed that if I use the -pr option, the repair process went down
> from 30 hours to 9 hours.  Is the -pr option safe to use if I want to run
> repair processes in parallel on nodes that are not replication peers?


There is pretty much two use case for repair:
1) to rebuild a node: if say a node has lost some data due to a hard
drive corruption or the like and you want to to rebuild what's missing
2) the periodic repairs to avoid problem with deleted data coming back
from the dead (basically:
http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair)

In case 1) you want to run 'nodetool repair' (without -pr) against the
node to rebuild.
In case 2) (which I suspect is the case your talking now), you *want*
to use 'nodetool repair -pr' on *every* node of the cluster. I.e.
that's the most efficient way to do it. The only reason not to use -pr
in this case would be that it's not available because you're using an
old version of Cassandra. And yes, it's is safe to run with -pr in
parallel on nodes that are not replication peers.

--
Sylvain


>
> thanks
>
>
> On Thu, Apr 12, 2012 at 12:06 AM, Frank Ng <berryt...@gmail.com> wrote:
>>
>> Thank you for confirming that the per node data size is most likely
>> causing the long repair process.  I have tried a repair on smaller column
>> families and it was significantly faster.
>>
>> On Wed, Apr 11, 2012 at 9:55 PM, aaron morton <aa...@thelastpickle.com>
>> wrote:
>>>
>>> If you have 1TB of data it will take a long time to repair. Every bit of
>>> data has to be read and a hash generated. This is one of the reasons we
>>> often suggest that around 300 to 400Gb per node is a good load in the
>>> general case.
>>>
>>> Look at nodetool compactionstats .Is there a validation compaction
>>> running ? If so it is still building the merkle  hash tree.
>>>
>>> Look at nodetool netstats . Is it streaming data ? If so all hash trees
>>> have been calculated.
>>>
>>> Cheers
>>>
>>>
>>> -----------------
>>> Aaron Morton
>>> Freelance Developer
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>>
>>> On 12/04/2012, at 2:16 AM, Frank Ng wrote:
>>>
>>> Can you expand further on your issue? Were you using Random Patitioner?
>>>
>>> thanks
>>>
>>> On Tue, Apr 10, 2012 at 5:35 PM, David Leimbach <leim...@gmail.com>
>>> wrote:
>>>>
>>>> I had this happen when I had really poorly generated tokens for the
>>>> ring.  Cassandra seems to accept numbers that are too big.  You get hot
>>>> spots when you think you should be balanced and repair never ends (I think
>>>> there is a 48 hour timeout).
>>>>
>>>>
>>>> On Tuesday, April 10, 2012, Frank Ng wrote:
>>>>>
>>>>> I am not using tier-sized compaction.
>>>>>
>>>>>
>>>>> On Tue, Apr 10, 2012 at 12:56 PM, Jonathan Rhone <rh...@tinyco.com>
>>>>> wrote:
>>>>>>
>>>>>> Data size, number of nodes, RF?
>>>>>>
>>>>>> Are you using size-tiered compaction on any of the column families
>>>>>> that hold a lot of your data?
>>>>>>
>>>>>> Do your cassandra logs say you are streaming a lot of ranges?
>>>>>> zgrep -E "(Performing streaming repair|out of sync)"
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 10, 2012 at 9:45 AM, Igor <i...@4friends.od.ua> wrote:
>>>>>>>
>>>>>>> On 04/10/2012 07:16 PM, Frank Ng wrote:
>>>>>>>
>>>>>>> Short answer - yes.
>>>>>>> But you are asking wrong question.
>>>>>>>
>>>>>>>
>>>>>>> I think both processes are taking a while.  When it starts up,
>>>>>>> netstats and compactionstats show nothing.  Anyone out there 
>>>>>>> successfully
>>>>>>> using ext3 and their repair processes are faster than this?
>>>>>>>
>>>>>>> On Tue, Apr 10, 2012 at 10:42 AM, Igor <i...@4friends.od.ua> wrote:
>>>>>>>>
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> You can check with nodetool  which part of repair process is slow -
>>>>>>>> network streams or verify compactions. use nodetool netstats or
>>>>>>>> compactionstats.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 04/10/2012 05:16 PM, Frank Ng wrote:
>>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I am on Cassandra 1.0.7.  My repair processes are taking over 30
>>>>>>>>> hours to complete.  Is it normal for the repair process to take this 
>>>>>>>>> long?
>>>>>>>>>  I wonder if it's because I am using the ext3 file system.
>>>>>>>>>
>>>>>>>>> thanks
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jonathan Rhone
>>>>>> Software Engineer
>>>>>>
>>>>>> TinyCo
>>>>>> 800 Market St., Fl 6
>>>>>> San Francisco, CA 94102
>>>>>> www.tinyco.com
>>>>>>
>>>>>
>>>
>>>
>>
>

Re: Repair Process Taking too long

Reply via email to