Re: Cassandra nodes failing with OOM

Janne Jalkanen Mon, 19 Nov 2012 03:08:41 -0800

Something that bit us recently was the size of bloom filters: we have a column 
family which is mostly written to, and only read sequentially, so we were able 
to free a lot of memory and decrease GC pressure by increasing 
bloom_filter_fp_chance for that particular CF.


This on 1.0.12.

/Janne

On 18 Nov 2012, at 21:38, aaron morton wrote:

>> 1. How much GCInspector warnings per hour are considered 'normal'?
> None. 
> A couple during compaction or repair is not the end of the world. But if you 
> have enough to thinking about "per hour" it's too many. 
> 
>> 2. What should be the next thing to check?
> Try to determine if the GC activity correlates to application workload, 
> compaction or repair. 
> 
> Try to determine what the working set of the server is. Watch the GC activity 
> (via gc logs or JMX) and see what the size of the tenured heap is after a 
> CMS. Or try to calculate it 
> http://www.mail-archive.com/user@cassandra.apache.org/msg25762.html
> 
> Look at your data model and query patterns for places where very large 
> queries are being made. Or rows that are very long lived with a lot of 
> deletes (prob not as much as an issue with LDB). 
> 
> 
>> 3. What are the possible failure reasons and how to prevent those?
> 
> As above. 
> As a work around sometimes drastically slowing down compaction can help. For 
> LDB try reducing in_memory_compaction_limit_in_mb and 
> compaction_throughput_mb_per_sec
> 
> 
> Hope that helps. 
> 
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
> 
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 17/11/2012, at 7:07 PM, Ивaн Cобoлeв <sobol...@gmail.com> wrote:
> 
>> Dear Community, 
>> 
>> advice from you needed. 
>> 
>> We have a cluster, 1/6 nodes of which died for various reasons(3 had OOM 
>> message). 
>> Nodes died in groups of 3, 1, 2. No adjacent died, though we use 
>> SimpleSnitch.
>> 
>> Version:         1.1.6
>> Hardware:      12Gb RAM / 8 cores(virtual)
>> Data:              40Gb/node
>> Nodes:           36 nodes
>> 
>> Keyspaces:    2(RF=3, R=W=2) + 1(OpsCenter)
>> CFs:                36, 2 indexes
>> Partitioner:      Random
>> Compaction:   Leveled(we don't want 2x space for housekeeping)
>> Caching:          Keys only
>> 
>> All is pretty much standard apart from the one CF receiving writes in 64K 
>> chunks and having sstable_size_in_mb=100.
>> No JNA installed - this is to be fixed soon.
>> 
>> Checking sysstat/sar I can see 80-90% CPU idle, no anomalies in io and the 
>> only change - network activity spiking. 
>> All the nodes before dying had the following on logs:
>>> INFO [ScheduledTasks:1] 2012-11-15 21:35:05,512 StatusLogger.java (line 72) 
>>> MemtablePostFlusher               1         4         0
>>> INFO [ScheduledTasks:1] 2012-11-15 21:35:13,540 StatusLogger.java (line 72) 
>>> FlushWriter                       1         3         0
>>> INFO [ScheduledTasks:1] 2012-11-15 21:36:32,162 StatusLogger.java (line 72) 
>>> HintedHandoff                     1         6         0
>>> INFO [ScheduledTasks:1] 2012-11-15 21:36:32,162 StatusLogger.java (line 77) 
>>> CompactionManager                 5         9
>> 
>> GCInspector warnings were there too, they went from ~0.8 to 3Gb heap in 
>> 5-10mins.
>> 
>> So, could you please give me a hint on:
>> 1. How much GCInspector warnings per hour are considered 'normal'?
>> 2. What should be the next thing to check?
>> 3. What are the possible failure reasons and how to prevent those?
>> 
>> Thank you very much in advance,
>> Ivan

Re: Cassandra nodes failing with OOM

Reply via email to