Re: entire range of node out of sync -- out of the blue

aaron morton Wed, 05 Dec 2012 19:06:25 -0800

> - how do i stop repair before i run out of storage? ( can't let this finish )


To stop the validation part of the repair…

nodetool -h localhost stop VALIDATION 


The only way I know to stop streaming is restart the node, their may be a 
better way though. 


> INFO [AntiEntropySessions:3] 2012-12-05 02:15:02,301 AntiEntropyService.java 
> (line 666) [repair #7c7665c0-3eab-11e2-0000-dae6667065ff] new session: will 
> sync /X.X.1.113, /X.X.0.71 on range 
> (85070591730234615865843651857942052964,0] for ( .. )
Am assuming this was ran on the first node in DC west with -pr as you said.
The log message is saying this is going to repair the primary range for the 
node for the node. The repair is then actually performed one CF at a time. 

You should also see log messages ending with "range(s) out of sync" which will 
say how out of sync the data is. 
 
> - how do i clean up my stables ( grew from 6k to 20k since this started, 
> while i shut writes off completely )
Sounds like repair is streaming a lot of differences. 
If you have the space I would give  Levelled compaction time to take care of 
it. 

Hope that helps.

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 6/12/2012, at 1:32 AM, Andras Szerdahelyi 
<andras.szerdahe...@ignitionone.com> wrote:

> hi list,
> 
> AntiEntropyService started syncing ranges of entire nodes ( ?! ) across my 
> data centers and i'd like to understand why. 
> 
> I see log lines like this on all my nodes in my two ( east/west ) data 
> centres...
> 
> INFO [AntiEntropySessions:3] 2012-12-05 02:15:02,301 AntiEntropyService.java 
> (line 666) [repair #7c7665c0-3eab-11e2-0000-dae6667065ff] new session: will 
> sync /X.X.1.113, /X.X.0.71 on range 
> (85070591730234615865843651857942052964,0] for ( .. )
> 
> ( this is around 80-100 GB of data for a single node. )
> 
> - i did not observe any network failures or nodes falling off the ring
> - good distribution of data ( load is equal on all nodes )
> - hinted handoff is on
> - read repair chance is 0.1 on the CF
> - 2 replicas in each data centre ( which is also the number of nodes in each 
> ) with NetworkTopologyStrategy
> - repair -pr is scheduled to run off-peak hours, daily
> - leveled compaction with stable max size 256mb ( i have found this to 
> trigger compaction in acceptable intervals while still keeping the stable 
> count down )
> - i am on 1.1.6
> - java heap 10G
> - max memtables 2G
> - 1G row cache
> - 256M key cache
> 
> my nodes'  ranges are:
> 
> DC west
> 0
> 85070591730234615865843651857942052864
> 
> DC east
> 100
> 85070591730234615865843651857942052964
> 
> symptoms are:
> - logs show sstables being streamed over to other nodes
> - 140k files in data dir of CF on all nodes
> - cfstats reports 20k sstables, up from 6 on all nodes
> - compaction continuously running with no results whatsoever ( number of 
> stables growing )
> 
> i tried the following:
> - offline scrub ( has gone OOM, i noticed the script in the debian package 
> specifies 256MB heap? )
> - online scrub ( no effect )
> - repair ( no effect )
> - cleanup ( no effect )
> 
> my questions are:
> - how do i stop repair before i run out of storage? ( can't let this finish )
> - how do i clean up my stables ( grew from 6k to 20k since this started, 
> while i shut writes off completely )
> 
> thanks,
> Andras
> 
> Andras Szerdahelyi
> Solutions Architect, IgnitionOne | 1831 Diegem E.Mommaertslaan 20A
> M: +32 493 05 50 88 | Skype: sandrew84
> 
> 
> <C4798BB9-9092-4145-880B-A72C6B7AF9A4[41].png>
> 
>

Re: entire range of node out of sync -- out of the blue

Reply via email to