Hi, all, thank you very much for the help. Aaron was right - we had a multiget_count query, which depending on the app input would result in a calculation performed for ~40k keys.
We've released the fix and ~100 GCInspector warnings per day per node went to ~1 per day per 30 nodes :) Thank you very much! Ivan 2012/11/19 Viktor Jevdokimov <viktor.jevdoki...@adform.com> > We've seen OOM in a situation, when OS was not properly prepared in > production.**** > > ** ** > > http://www.datastax.com/docs/1.1/install/recommended_settings**** > > ** ** > > ** ** > > ** ** > Best regards / Pagarbiai > *Viktor Jevdokimov* > Senior Developer > > Email: viktor.jevdoki...@adform.com > Phone: +370 5 212 3063, Fax +370 5 261 0453 > J. Jasinskio 16C, LT-01112 Vilnius, Lithuania > Follow us on Twitter: @adforminsider <http://twitter.com/#!/adforminsider> > Take a ride with Adform's Rich Media Suite<http://vimeo.com/adform/richmedia> > [image: Adform News] <http://www.adform.com> > > Disclaimer: The information contained in this message and attachments is > intended solely for the attention and use of the named addressee and may be > confidential. If you are not the intended recipient, you are reminded that > the information remains the property of the sender. You must not use, > disclose, distribute, copy, print or rely on this e-mail. If you have > received this message in error, please contact the sender immediately and > irrevocably delete this message and any copies. > > *From:* some.unique.lo...@gmail.com [mailto:some.unique.lo...@gmail.com] > *On Behalf Of *Ивaн Cобoлeв > *Sent:* Saturday, November 17, 2012 08:08 > *To:* user@cassandra.apache.org > *Subject:* Cassandra nodes failing with OOM**** > > ** ** > > Dear Community, **** > > ** ** > > advice from you needed. **** > > ** ** > > We have a cluster, 1/6 nodes of which died for various reasons(3 had OOM > message). **** > > Nodes died in groups of 3, 1, 2. No adjacent died, though we use > SimpleSnitch.**** > > ** ** > > Version: 1.1.6**** > > Hardware: 12Gb RAM / 8 cores(virtual)**** > > Data: 40Gb/node**** > > Nodes: 36 nodes**** > > ** ** > > Keyspaces: 2(RF=3, R=W=2) + 1(OpsCenter)**** > > CFs: 36, 2 indexes**** > > Partitioner: Random**** > > Compaction: Leveled(we don't want 2x space for housekeeping)**** > > Caching: Keys only**** > > ** ** > > All is pretty much standard apart from the one CF receiving writes in 64K > chunks and having sstable_size_in_mb=100.**** > > No JNA installed - this is to be fixed soon.**** > > ** ** > > Checking sysstat/sar I can see 80-90% CPU idle, no anomalies in io and the > only change - network activity spiking. **** > > All the nodes before dying had the following on logs:**** > > > INFO [ScheduledTasks:1] 2012-11-15 21:35:05,512 StatusLogger.java (line > 72) MemtablePostFlusher 1 4 0**** > > > INFO [ScheduledTasks:1] 2012-11-15 21:35:13,540 StatusLogger.java (line > 72) FlushWriter 1 3 0**** > > > INFO [ScheduledTasks:1] 2012-11-15 21:36:32,162 StatusLogger.java (line > 72) HintedHandoff 1 6 0**** > > > INFO [ScheduledTasks:1] 2012-11-15 21:36:32,162 StatusLogger.java (line > 77) CompactionManager 5 9**** > > ** ** > > GCInspector warnings were there too, they went from ~0.8 to 3Gb heap in > 5-10mins.**** > > ** ** > > So, could you please give me a hint on:**** > > 1. How much GCInspector warnings per hour are considered 'normal'?**** > > 2. What should be the next thing to check?**** > > 3. What are the possible failure reasons and how to prevent those?**** > > ** ** > > Thank you very much in advance,**** > > Ivan**** >
<<signature-logo49d2.png>>