Re: Memory usage increases on every job restart resulting in eventual OOMKill

Yun Tang Wed, 03 Feb 2021 00:34:54 -0800

Hi Randal,

Please consider to use jemalloc instead of glibc as default memory allocator 
[1] to avoid memory fragmentation. As far as I know, at least two groups of 
users, who run Flink on YARN and k8s respectively, have reported similar 
problem that memory continues growing up once restart [2]. The problem both 
went away once they adopt to use JeMalloc.

[1] https://issues.apache.org/jira/browse/FLINK-19125
[2] https://issues.apache.org/jira/browse/FLINK-18712

Best
Yun Tang
________________________________
From: Lasse Nedergaard <lassenedergaardfl...@gmail.com>
Sent: Wednesday, February 3, 2021 14:07
To: Xintong Song <tonysong...@gmail.com>
Cc: user <user@flink.apache.org>
Subject: Re: Memory usage increases on every job restart resulting in eventual 
OOMKill

Hi

We had something similar and our problem was class loader leaks. We used a 
summary log component to reduce logging but still turned out that it used a 
static object that wasn’t released when we got an OOM or restart. Flink was 
reusing task managers so only workaround was to stop the job wait until they 
was removed and start again until we fixed the underlying problem.

Med venlig hilsen / Best regards
Lasse Nedergaard

Den 3. feb. 2021 kl. 02.54 skrev Xintong Song <tonysong...@gmail.com>:

How is the memory measured?
I meant which flink or k8s metric is collected? I'm asking because depending on 
which metric is used, the *container memory usage* can be defined differently. 
E.g., whether mmap memory is included.

Also, could you share the effective memory configurations for the taskmanagers? 
You should find something like the following at the beginning of taskmanger 
logs.

INFO  [] - Final TaskExecutor Memory configuration:
INFO  [] -   Total Process Memory:          1.688gb (1811939328 bytes)
INFO  [] -     Total Flink Memory:          1.250gb (1342177280 bytes)
INFO  [] -       Total JVM Heap Memory:     512.000mb (536870902 bytes)
INFO  [] -         Framework:               128.000mb (134217728 bytes)
INFO  [] -         Task:                    384.000mb (402653174 bytes)
INFO  [] -       Total Off-heap Memory:     768.000mb (805306378 bytes)
INFO  [] -         Managed:                 512.000mb (536870920 bytes)
INFO  [] -         Total JVM Direct Memory: 256.000mb (268435458 bytes)
INFO  [] -           Framework:             128.000mb (134217728 bytes)
INFO  [] -           Task:                  0 bytes
INFO  [] -           Network:               128.000mb (134217730 bytes)
INFO  [] -     JVM Metaspace:               256.000mb (268435456 bytes)
INFO  [] -     JVM Overhead:                192.000mb (201326592 bytes)

Thank you~

Xintong Song

On Tue, Feb 2, 2021 at 8:59 PM Randal Pitt 
<randal.p...@foresite.com<mailto:randal.p...@foresite.com>> wrote:
Hi Xintong Song,

Correct, we are using standalone k8s. Task managers are deployed as a
statefulset so have consistent pod names. We tried using native k8s (in fact
I'd prefer to) but got persistent
"io.fabric8.kubernetes.client.KubernetesClientException: too old resource
version: 242214695 (242413759)" errors which resulted in jobs being
restarted every 30-60 minutes.

We are using Prometheus Node Exporter to capture memory usage. The graph
shows the metric:

sum(container_memory_usage_bytes{container_name="taskmanager",pod_name=~"$flink_task_manager"})
by (pod_name)

I've  attached the original
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2869/Screenshot_2021-02-02_at_11.png>
so Nabble doesn't shrink it.

Best regards,

Randal.

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Memory usage increases on every job restart resulting in eventual OOMKill

Reply via email to