Re: Out of memory and/or OOM kill on a cluster

Vincent Rischmann Mon, 21 Nov 2016 08:10:44 -0800

Thanks for your answer Alexander.



We're writing constantly to the table, we estimate it's something like
1.5k to 2k writes per second. Some of these requests update a bunch of
fields, some update fields + append something to a set.
We don't read constantly from it but when we do it's a lot of read, up
to 20k reads per second sometimes.
For this particular keyspace everything is using the size tiered
compaction strategy.


 - Every node is a physical server, has a 8-Core CPU, 32GB of ram and
   3TB of SSD.
 - Java version is 1.8.0_101 for all nodes except one which is using
   1.8.0_111 (only for about a week I think, before that it used
   1.8.0_101 too).
 - We're using the G1 GC. I looked at the 19th and on that day we had:

  - 1505 GCs

  - 2 Old Gen GCs which took around 5s each

  - the rest are New Gen GCs, with only 1 other 1s. There's 15 to 20 GCs
    which took between 0.4 and 0.7s. The rest is between 250ms and 400ms
    approximately.
Sometimes, there are 3/4/5 GCs in a row in like 2 seconds, each taking
between 250ms to 400ms, but it's kinda rare from what I can see.
 - So regarding GC logs, I have them enabled, I've got a bunch of
   gc.log.X files in /var/log/cassandra, but somehow I can't find any
   log files for certain periods. On one node which crashed this morning
   I lost like a week of GC logs, no idea what is happening there...
 - I'll just put a couple of warnings here, there are around 9k just
   for today.


WARN  [SharedPool-Worker-8] 2016-11-21 17:02:00,497
SliceQueryFilter.java:320 - Read 2001 live and 11129 tombstone cells in
foo.install_info for key: foo@IOS:7 (see tombstone_warn_threshold). 2000
columns were requested, slices=[-]
WARN  [SharedPool-Worker-1] 2016-11-21 17:02:02,559
SliceQueryFilter.java:320 - Read 2001 live and 11064 tombstone cells in
foo.install_info for key: foo@IOS:7 (see tombstone_warn_threshold). 2000
columns were requested, slices=[di[42FB29E1-8C99-45BE-8A44-9480C50C6BC4]:!-
]
WARN  [SharedPool-Worker-2] 2016-11-21 17:02:05,286
SliceQueryFilter.java:320 - Read 2001 live and 11064 tombstone cells in
foo.install_info for key: foo@IOS:7 (see tombstone_warn_threshold). 2000
columns were requested, slices=[di[42FB29E1-8C99-45BE-8A44-9480C50C6BC4]:!-
]
WARN  [SharedPool-Worker-11] 2016-11-21 17:02:08,860
SliceQueryFilter.java:320 - Read 2001 live and 19966 tombstone cells in
foo.install_info for key: foo@IOS:10 (see tombstone_warn_threshold).
2000 columns were requested, slices=[-]


So, we're guessing this is bad since it's warning us, however does this
have a significant on the heap / GC ? I don't really know.


- cfstats tells me this:



Average live cells per slice (last five minutes): 1458.029594846951

Maximum live cells per slice (last five minutes): 2001.0

Average tombstones per slice (last five minutes): 1108.2466913854232

Maximum tombstones per slice (last five minutes): 22602.0



- regarding swap, it's not disabled anywhere, I must say we never really
  thought about it. Does it provide a significant benefit ?


Thanks for your help, really appreciated !



On Mon, Nov 21, 2016, at 04:13 PM, Alexander Dejanovski wrote:

> Vincent,

> 

> only the 2.68GB partition is out of bounds here, all the others
> (<256MB) shouldn't be much of a problem.
> It could put pressure on your heap if it is often read and/or
> compacted.
> But to answer your question about the 1% harming the cluster, a few
> big partitions can definitely be a big problem depending on your
> access patterns.
> Which compaction strategy are you using on this table ?

> 

> Could you provide/check the following things on a node that crashed
> recently :
>  * Hardware specifications (how many cores ? how much RAM ? Bare metal
>    or VMs ?)
>  * Java version
>  * GC pauses throughout a day (grep GCInspector
>    /var/log/cassandra/system.log) : check if you have many pauses that
>    take more than 1 second
>  * GC logs at the time of a crash (if you don't produce any, you
>    should activate them in cassandra-env.sh)
>  * Tombstone warnings in the logs and high number of tombstone read in
>    cfstats
>  * Make sure swap is disabled
> 

> Cheers,

> 

> 

> On Mon, Nov 21, 2016 at 2:57 PM Vincent Rischmann
> <m...@vrischmann.me> wrote:
>> __

>> @Vladimir

>> 

>> We tried with 12Gb and 16Gb, the problem appeared eventually too.

>> In this particular cluster we have 143 tables across 2 keyspaces.

>> 

>> @Alexander

>> 

>> We have one table with a max partition of 2.68GB, one of 256 MB, a
>> bunch with the size varying between 10MB to 100MB ~. Then there's the
>> rest with the max lower than 10MB.
>> 

>> On the biggest, the 99% is around 60MB, 98% around 25MB, 95%
>> around 5.5MB.
>> On the one with max of 256MB, the 99% is around 4.6MB, 98%
>> around 2MB.
>> 

>> Could the 1% here really have that much impact ? We do write a lot to
>> the biggest table and read quite often too, however I have no way to
>> know if that big partition is ever read.
>> 

>> 

>> On Mon, Nov 21, 2016, at 01:09 PM, Alexander Dejanovski wrote:

>>> Hi Vincent,

>>> 

>>> one of the usual causes of OOMs is very large partitions.

>>> Could you check your nodetool cfstats output in search of large
>>> partitions ? If you find one (or more), run nodetool cfhistograms on
>>> those tables to get a view of the partition sizes distribution.
>>> 

>>> Thanks

>>> 

>>> On Mon, Nov 21, 2016 at 12:01 PM Vladimir Yudovin
>>> <vla...@winguzone.com> wrote:
>>>> __

>>>> Did you try any value in the range 8-20 (e.g. 60-70% of physical
>>>> memory).
>>>> Also how many tables do you have across all keyspaces? Each table
>>>> can consume minimum 1M of Java heap.
>>>> 

>>>> Best regards, Vladimir Yudovin, 

>>>> *Winguzone[1] - Hosted Cloud Cassandra Launch your cluster in
>>>> minutes.*
>>>> 

>>>> 

>>>> ---- On Mon, 21 Nov 2016 05:13:12 -0500*Vincent Rischmann
>>>> <m...@vrischmann.me>* wrote ----
>>>> 

>>>>> Hello,

>>>>> 

>>>>> we have a 8 node Cassandra 2.1.15 cluster at work which is giving
>>>>> us a lot of trouble lately.
>>>>> 

>>>>> The problem is simple: nodes regularly die because of an out of
>>>>> memory exception or the Linux OOM killer decides to kill the
>>>>> process.
>>>>> For a couple of weeks now we increased the heap to 20Gb hoping it
>>>>> would solve the out of memory errors, but in fact it didn't;
>>>>> instead of getting out of memory exception the OOM killer killed
>>>>> the JVM.
>>>>> 

>>>>> We reduced the heap on some nodes to 8Gb to see if it would work
>>>>> better, but some nodes crashed again with out of memory exception.
>>>>> 

>>>>> I suspect some of our tables are badly modelled, which would cause
>>>>> Cassandra to allocate a lot of data, however I don't how to prove
>>>>> that and/or find which table is bad, and which query is
>>>>> responsible.
>>>>> 

>>>>> I tried looking at metrics in JMX, and tried profiling using
>>>>> mission control but it didn't really help; it's possible I missed
>>>>> it because I have no idea what to look for exactly.
>>>>> 

>>>>> Anyone have some advice for troubleshooting this ?

>>>>> 

>>>>> Thanks.

>>> -- 

>>> -----------------

>>> Alexander Dejanovski

>>> France

>>> @alexanderdeja

>>> 

>>> Consultant

>>> Apache Cassandra Consulting

>>> http://www.thelastpickle.com[2]

>> 

> -- 

> -----------------

> Alexander Dejanovski

> France

> @alexanderdeja

> 

> Consultant

> Apache Cassandra Consulting

> http://www.thelastpickle.com[3]




Links:

  1. https://winguzone.com?from=list
  2. http://www.thelastpickle.com/
  3. http://www.thelastpickle.com/

Re: Out of memory and/or OOM kill on a cluster

Reply via email to