Re: entire range of node out of sync -- out of the blue

aaron morton Thu, 06 Dec 2012 19:38:52 -0800

The log message matches what I would expect to see for nodetool -pr

Not using pr means repair all the ranges the node is a replica for. If you have 
RF == number of nodes, then it will repair all the data.


Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 6/12/2012, at 9:42 PM, Andras Szerdahelyi 
<andras.szerdahe...@ignitionone.com> wrote:

> Thanks!
> 
> i'm also thinking a repair  run without -pr could have caused this maybe ?
> 
> 
> Andras Szerdahelyi
> Solutions Architect, IgnitionOne | 1831 Diegem E.Mommaertslaan 20A
> M: +32 493 05 50 88 | Skype: sandrew84
> 
> 
> <C4798BB9-9092-4145-880B-A72C6B7AF9A4[41].png>
> 
> 
> On 06 Dec 2012, at 04:05, aaron morton <aa...@thelastpickle.com> wrote:
> 
>>> - how do i stop repair before i run out of storage? ( can't let this finish 
>>> )
>> 
>> To stop the validation part of the repair…
>> 
>> nodetool -h localhost stop VALIDATION 
>> 
>> 
>> The only way I know to stop streaming is restart the node, their may be a 
>> better way though. 
>> 
>> 
>>> INFO [AntiEntropySessions:3] 2012-12-05 02:15:02,301 
>>> AntiEntropyService.java (line 666) [repair 
>>> #7c7665c0-3eab-11e2-0000-dae6667065ff] new session: will sync /X.X.1.113, 
>>> /X.X.0.71 on range (85070591730234615865843651857942052964,0] for ( .. )
>> Am assuming this was ran on the first node in DC west with -pr as you said.
>> The log message is saying this is going to repair the primary range for the 
>> node for the node. The repair is then actually performed one CF at a time. 
>> 
>> You should also see log messages ending with "range(s) out of sync" which 
>> will say how out of sync the data is. 
>>  
>>> - how do i clean up my stables ( grew from 6k to 20k since this started, 
>>> while i shut writes off completely )
>> Sounds like repair is streaming a lot of differences. 
>> If you have the space I would give  Levelled compaction time to take care of 
>> it. 
>> 
>> Hope that helps.
>> 
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> New Zealand
>> 
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 6/12/2012, at 1:32 AM, Andras Szerdahelyi 
>> <andras.szerdahe...@ignitionone.com> wrote:
>> 
>>> hi list,
>>> 
>>> AntiEntropyService started syncing ranges of entire nodes ( ?! ) across my 
>>> data centers and i'd like to understand why. 
>>> 
>>> I see log lines like this on all my nodes in my two ( east/west ) data 
>>> centres...
>>> 
>>> INFO [AntiEntropySessions:3] 2012-12-05 02:15:02,301 
>>> AntiEntropyService.java (line 666) [repair 
>>> #7c7665c0-3eab-11e2-0000-dae6667065ff] new session: will sync /X.X.1.113, 
>>> /X.X.0.71 on range (85070591730234615865843651857942052964,0] for ( .. )
>>> 
>>> ( this is around 80-100 GB of data for a single node. )
>>> 
>>> - i did not observe any network failures or nodes falling off the ring
>>> - good distribution of data ( load is equal on all nodes )
>>> - hinted handoff is on
>>> - read repair chance is 0.1 on the CF
>>> - 2 replicas in each data centre ( which is also the number of nodes in 
>>> each ) with NetworkTopologyStrategy
>>> - repair -pr is scheduled to run off-peak hours, daily
>>> - leveled compaction with stable max size 256mb ( i have found this to 
>>> trigger compaction in acceptable intervals while still keeping the stable 
>>> count down )
>>> - i am on 1.1.6
>>> - java heap 10G
>>> - max memtables 2G
>>> - 1G row cache
>>> - 256M key cache
>>> 
>>> my nodes'  ranges are:
>>> 
>>> DC west
>>> 0
>>> 85070591730234615865843651857942052864
>>> 
>>> DC east
>>> 100
>>> 85070591730234615865843651857942052964
>>> 
>>> symptoms are:
>>> - logs show sstables being streamed over to other nodes
>>> - 140k files in data dir of CF on all nodes
>>> - cfstats reports 20k sstables, up from 6 on all nodes
>>> - compaction continuously running with no results whatsoever ( number of 
>>> stables growing )
>>> 
>>> i tried the following:
>>> - offline scrub ( has gone OOM, i noticed the script in the debian package 
>>> specifies 256MB heap? )
>>> - online scrub ( no effect )
>>> - repair ( no effect )
>>> - cleanup ( no effect )
>>> 
>>> my questions are:
>>> - how do i stop repair before i run out of storage? ( can't let this finish 
>>> )
>>> - how do i clean up my stables ( grew from 6k to 20k since this started, 
>>> while i shut writes off completely )
>>> 
>>> thanks,
>>> Andras
>>> 
>>> Andras Szerdahelyi
>>> Solutions Architect, IgnitionOne | 1831 Diegem E.Mommaertslaan 20A
>>> M: +32 493 05 50 88 | Skype: sandrew84
>>> 
>>> 
>>> <C4798BB9-9092-4145-880B-A72C6B7AF9A4[41].png>
>>> 
>>> 
>> 
>

Re: entire range of node out of sync -- out of the blue

Reply via email to