Re:回复: oomkill issue

Xuyang Mon, 27 Nov 2023 19:13:03 -0800

Hi, Prashant.
I think Yu Chen has given a professional troubleshooting ideas. Another thing I 
want to ask is whether you use some 
user defined function to store some objects? You can firstly dump the memory 
and get more details to check for memory leaks.

Best！
Xuyang

在 2023-11-28 09:12:01，"Yu Chen" <yuchen.e...@gmail.com> 写道：

Hi Prashant,

OOMkill was mostly caused by workset memory exceed the pod limit.
We have to first expand the OVERHEAD memory properly by the following params to
observe if the problem can be solved.
```
taskmanager.memory.jvm-overhead.max=1536m
taskmanager.memory.jvm-overhead.min=1536m
```

And if the OOMKill still exists, we need to suspect if the task has an off-heap
memory leak.
One of the most popular tools, jemallc, is recommended. You have to install the
jemalloc in the image arrording to the document[1].
After that, you can enable jemalloc profiling by setting environment for the
taskmanager:
```

containerized.taskmanager.env.MALLOC_CONF=prof:true,lg_prof_interval:30,lg_prof_sample:16,prof_prefix:/tmp/jeprof.out

```
After running for a while, you can log into the Taskmanager to generate svg
files to troubleshoot off-heap memory distribution.
```
jeprof --show_bytes -svg `which java` /tmp/jeprof.out.301.1009.i1009.heap >
1009.svg
```

Otherwise, if the OOMKill no longer occurs, but the GC overhead limit exceeded,
then you should dump heap memory to find out what objects are taking up so much
of the memory.
Here is the command for you.
```
jmap -dump:live,format=b,file=/tmp/heap.hprof <pid>
```

[1] Using jemalloc to Optimize Memory Allocation — Sentieon Appnotes 202308.01
documentation

Best,
Yu Chen
发件人: prashant parbhane <parbhane....@gmail.com>
发送时间: 2023年11月28日 1:42
收件人: user@flink.apache.org <user@flink.apache.org>
主题: oomkill issue

Hello,

We have been facing this oomkill issue, where task managers are getting
restarted with this error.
I am seeing memory consumption increasing in a linear manner, i have given
memory and CPU as high as possible but still facing the same issue.

We are using rocksdb for the state backend, is there a way to find which
operator causing this issue? or find which operator takes more memory? Any good
practice that we can follow? We are using broadcast state.

Thanks,
Prashant

Re:回复: oomkill issue

Reply via email to