Thanks! i'm also thinking a repair run without -pr could have caused this maybe ?
Andras Szerdahelyi Solutions Architect, IgnitionOne | 1831 Diegem E.Mommaertslaan 20A M: +32 493 05 50 88 | Skype: sandrew84 [cid:7BDF7228-D831-4D98-967A-BE04FEB17544] On 06 Dec 2012, at 04:05, aaron morton <aa...@thelastpickle.com<mailto:aa...@thelastpickle.com>> wrote: - how do i stop repair before i run out of storage? ( can't let this finish ) To stop the validation part of the repair… nodetool -h localhost stop VALIDATION The only way I know to stop streaming is restart the node, their may be a better way though. INFO [AntiEntropySessions:3] 2012-12-05 02:15:02,301 AntiEntropyService.java (line 666) [repair #7c7665c0-3eab-11e2-0000-dae6667065ff] new session: will sync /X.X.1.113, /X.X.0.71 on range (85070591730234615865843651857942052964,0] for ( .. ) Am assuming this was ran on the first node in DC west with -pr as you said. The log message is saying this is going to repair the primary range for the node for the node. The repair is then actually performed one CF at a time. You should also see log messages ending with "range(s) out of sync" which will say how out of sync the data is. - how do i clean up my stables ( grew from 6k to 20k since this started, while i shut writes off completely ) Sounds like repair is streaming a lot of differences. If you have the space I would give Levelled compaction time to take care of it. Hope that helps. ----------------- Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com<http://www.thelastpickle.com/> On 6/12/2012, at 1:32 AM, Andras Szerdahelyi <andras.szerdahe...@ignitionone.com<mailto:andras.szerdahe...@ignitionone.com>> wrote: hi list, AntiEntropyService started syncing ranges of entire nodes ( ?! ) across my data centers and i'd like to understand why. I see log lines like this on all my nodes in my two ( east/west ) data centres... INFO [AntiEntropySessions:3] 2012-12-05 02:15:02,301 AntiEntropyService.java (line 666) [repair #7c7665c0-3eab-11e2-0000-dae6667065ff] new session: will sync /X.X.1.113, /X.X.0.71 on range (85070591730234615865843651857942052964,0] for ( .. ) ( this is around 80-100 GB of data for a single node. ) - i did not observe any network failures or nodes falling off the ring - good distribution of data ( load is equal on all nodes ) - hinted handoff is on - read repair chance is 0.1 on the CF - 2 replicas in each data centre ( which is also the number of nodes in each ) with NetworkTopologyStrategy - repair -pr is scheduled to run off-peak hours, daily - leveled compaction with stable max size 256mb ( i have found this to trigger compaction in acceptable intervals while still keeping the stable count down ) - i am on 1.1.6 - java heap 10G - max memtables 2G - 1G row cache - 256M key cache my nodes' ranges are: DC west 0 85070591730234615865843651857942052864 DC east 100 85070591730234615865843651857942052964 symptoms are: - logs show sstables being streamed over to other nodes - 140k files in data dir of CF on all nodes - cfstats reports 20k sstables, up from 6 on all nodes - compaction continuously running with no results whatsoever ( number of stables growing ) i tried the following: - offline scrub ( has gone OOM, i noticed the script in the debian package specifies 256MB heap? ) - online scrub ( no effect ) - repair ( no effect ) - cleanup ( no effect ) my questions are: - how do i stop repair before i run out of storage? ( can't let this finish ) - how do i clean up my stables ( grew from 6k to 20k since this started, while i shut writes off completely ) thanks, Andras Andras Szerdahelyi Solutions Architect, IgnitionOne | 1831 Diegem E.Mommaertslaan 20A M: +32 493 05 50 88 | Skype: sandrew84 <C4798BB9-9092-4145-880B-A72C6B7AF9A4[41].png>
<<inline: C4798BB9-9092-4145-880B-A72C6B7AF9A4[41].png>>