Something that bit us recently was the size of bloom filters: we have a column family which is mostly written to, and only read sequentially, so we were able to free a lot of memory and decrease GC pressure by increasing bloom_filter_fp_chance for that particular CF.
This on 1.0.12. /Janne On 18 Nov 2012, at 21:38, aaron morton wrote: >> 1. How much GCInspector warnings per hour are considered 'normal'? > None. > A couple during compaction or repair is not the end of the world. But if you > have enough to thinking about "per hour" it's too many. > >> 2. What should be the next thing to check? > Try to determine if the GC activity correlates to application workload, > compaction or repair. > > Try to determine what the working set of the server is. Watch the GC activity > (via gc logs or JMX) and see what the size of the tenured heap is after a > CMS. Or try to calculate it > http://www.mail-archive.com/user@cassandra.apache.org/msg25762.html > > Look at your data model and query patterns for places where very large > queries are being made. Or rows that are very long lived with a lot of > deletes (prob not as much as an issue with LDB). > > >> 3. What are the possible failure reasons and how to prevent those? > > As above. > As a work around sometimes drastically slowing down compaction can help. For > LDB try reducing in_memory_compaction_limit_in_mb and > compaction_throughput_mb_per_sec > > > Hope that helps. > > > ----------------- > Aaron Morton > Freelance Cassandra Developer > New Zealand > > @aaronmorton > http://www.thelastpickle.com > > On 17/11/2012, at 7:07 PM, Ивaн Cобoлeв <sobol...@gmail.com> wrote: > >> Dear Community, >> >> advice from you needed. >> >> We have a cluster, 1/6 nodes of which died for various reasons(3 had OOM >> message). >> Nodes died in groups of 3, 1, 2. No adjacent died, though we use >> SimpleSnitch. >> >> Version: 1.1.6 >> Hardware: 12Gb RAM / 8 cores(virtual) >> Data: 40Gb/node >> Nodes: 36 nodes >> >> Keyspaces: 2(RF=3, R=W=2) + 1(OpsCenter) >> CFs: 36, 2 indexes >> Partitioner: Random >> Compaction: Leveled(we don't want 2x space for housekeeping) >> Caching: Keys only >> >> All is pretty much standard apart from the one CF receiving writes in 64K >> chunks and having sstable_size_in_mb=100. >> No JNA installed - this is to be fixed soon. >> >> Checking sysstat/sar I can see 80-90% CPU idle, no anomalies in io and the >> only change - network activity spiking. >> All the nodes before dying had the following on logs: >>> INFO [ScheduledTasks:1] 2012-11-15 21:35:05,512 StatusLogger.java (line 72) >>> MemtablePostFlusher 1 4 0 >>> INFO [ScheduledTasks:1] 2012-11-15 21:35:13,540 StatusLogger.java (line 72) >>> FlushWriter 1 3 0 >>> INFO [ScheduledTasks:1] 2012-11-15 21:36:32,162 StatusLogger.java (line 72) >>> HintedHandoff 1 6 0 >>> INFO [ScheduledTasks:1] 2012-11-15 21:36:32,162 StatusLogger.java (line 77) >>> CompactionManager 5 9 >> >> GCInspector warnings were there too, they went from ~0.8 to 3Gb heap in >> 5-10mins. >> >> So, could you please give me a hint on: >> 1. How much GCInspector warnings per hour are considered 'normal'? >> 2. What should be the next thing to check? >> 3. What are the possible failure reasons and how to prevent those? >> >> Thank you very much in advance, >> Ivan