@Michal: all true, a clean up would certainly remove a lot of useless data there, and I also advice Evan to do it. However, Evan may want to continue repairing his cluster as a routine operation an there is no reason a RF change shouldn't lead to this kind of issues.
@Evan : With this amount of data, and not being using C*1.2, you could try tuning your bloom filters to use less memory. Let's say disabling them the time to recover from this issue : bloom_filter_fp_chance = 1.0 then upgrade sstables and retry repairing. This depends a lot of your needs and your context, but it might work if you can afford it. By the way, C* prior 1.2 should not exceed 300-500 GB per node. I read once that C*1.2 aims to reach 3-5 TB per node. Yet, horizontal scaling, using peer-to-peer is one of the main point of Cassandra. You might be carefull and scale when needed to never reach that much data per node. As always, please experts/commiters, correct me if I am wrong. Alain 2013/7/4 Michał Michalski <mich...@opera.com> > I don't think you need to run repair if you decrease RF. At least I > wouldn't do it. > > In case of *decreasing* RF have 3 nodes containing some data, but only 2 > of them should store them from now on, so you should rather run cleanup, > instead of repair, toget rid of the data on 3rd replica. And I guess it > should work (in terms of disk space and memory), if you've been able to > perform compaction. > > Repair makes sense if you *increase* RF, so the data are streamed to the > new replicas. > > M. > > > W dniu 04.07.2013 12:20, Evan Dandrea pisze: > > Hi, >> >> We've made the mistake of letting our nodes get too large, now holding >> about 3TB each. We ran out of enough free space to have a successful >> compaction, and because we're on 1.0.7, enabling compression to get >> out of the mess wasn't feasible. We tried adding another node, but we >> think this may have put too much pressure on the existing ones it was >> replicating from, so we backed out. >> >> So we decided to drop RF down to 2 from 3 to relieve the disk pressure >> and started building a secondary cluster with lots of 1 TB nodes. We >> ran repair -pr on each node, but it’s failing with a JVM OOM on one >> node while another node is streaming from it for the final repair. >> >> Does anyone know what we can tune to get the cluster stable enough to >> put it in a multi-dc setup with the secondary cluster? Do we actually >> need to wait for these RF3->RF2 repairs to stabilize, or could we >> point it at the secondary cluster without worry of data loss? >> >> We’ve set the heap on these two problematic nodes to 20GB, up from the >> equally too high 12GB, but we’re still hitting OOM. I had seen in >> other threads that tuning down compaction might help, so we’re trying >> the following: >> >> in_memory_compaction_limit_in_**mb 32 (down from 64) >> compaction_throughput_mb_per_**sec 8 (down from 16) >> concurrent_compactors 2 (the nodes have 24 cores) >> flush_largest_memtables_at 0.45 (down from 0.50) >> stream_throughput_outbound_**megabits_per_sec 300 (down from 400) >> reduce_cache_sizes_at 0.5 (down from 0.6) >> reduce_cache_capacity_to 0.35 (down from 0.4) >> >> -XX:**CMSInitiatingOccupancyFraction**=30 >> >> Here’s the log from the most recent repair failure: >> >> http://paste.ubuntu.com/**5843017/ <http://paste.ubuntu.com/5843017/> >> >> The OOM starts at line 13401. >> >> Thanks for whatever insight you can provide. >> >> >