Re: 回复: oomkill issue

prashant parbhane Mon, 04 Dec 2023 12:42:57 -0800

Hi Yu,

Thanks for your reply.


When i run below script

```
*jeprof --show_bytes -svg `which java` /tmp/jeprof.out.301.1009.i1009.heap
 > 1009.svg*
```
i am getting below error

```



*Gathering CPU profile from http:///pprof/profile?seconds=30 for 30 seconds
to  /root/jeprof/java.1701718686.Be patient...Failed to get profile: curl
-s --fail --max-time 90 'http:///pprof/profile?seconds=30' >
/root/jeprof/.tmp.java.1701718686.: No such file or directory*
```
Any input on this?

However, oomkill was resolve with below rocksdb configurations

   - "state.backend.rocksdb.memory.managed": "false",
   "state.backend.rocksdb.block.cache-size": "10m",
   "state.backend.rocksdb.writebuffer.size": "128m",
   "state.backend.rocksdb.writebuffer.count": "134217728"
   "state.backend.rocksdb.ttl.compaction.filter.enabled":"true"


Thanks,
Prashant

On Mon, Nov 27, 2023 at 7:11 PM Xuyang <xyzhong...@163.com> wrote:

> Hi, Prashant.
> I think Yu Chen has given a professional troubleshooting ideas. Another
> thing I want to ask is whether you use some
> user defined function to store some objects? You can firstly dump the
> memory and get more details to check for memory leaks.
>
>
> --
>     Best！
>     Xuyang
>
>
> 在 2023-11-28 09:12:01，"Yu Chen" <yuchen.e...@gmail.com> 写道：
>
> Hi Prashant,
>
> OOMkill was mostly caused by workset memory exceed the pod limit.
> We have to first expand the OVERHEAD memory properly by the following
> params to observe if the problem can be solved.
> ```
> taskmanager.memory.jvm-overhead.max=1536m
> taskmanager.memory.jvm-overhead.min=1536m
> ```
>
> And if the OOMKill still exists, we need to suspect if the task has an
> off-heap memory leak.
> One of the most popular tools, jemallc, is recommended. You have to
> install the jemalloc in the image arrording to the document[1].
> After that, you can enable jemalloc profiling by setting environment for
> the taskmanager:
> ```
>
> containerized.taskmanager.env.MALLOC_CONF=
> prof:true,lg_prof_interval:30,lg_prof_sample:16,prof_prefix:/tmp/jeprof.out
>
> ```
> After running for a while, you can log into the Taskmanager to generate
> svg files to troubleshoot off-heap memory distribution.
> ```
> jeprof --show_bytes -svg `which java` /tmp/jeprof.out.301.1009.i1009.heap
>  > 1009.svg
> ```
>
> Otherwise, if the OOMKill no longer occurs, but the GC overhead limit
> exceeded, then you should dump heap memory to find out what objects are
> taking up so much of the memory.
> Here is the command for you.
> ```
> jmap -dump:live,format=b,file=/tmp/heap.hprof <pid>
> ```
>
> [1] Using jemalloc to Optimize Memory Allocation — Sentieon Appnotes
> 202308.01 documentation <https://support.sentieon.com/appnotes/jemalloc/>
>
> Best,
> Yu Chen
> ------------------------------
> *发件人:* prashant parbhane <parbhane....@gmail.com>
> *发送时间:* 2023年11月28日 1:42
> *收件人:* user@flink.apache.org <user@flink.apache.org>
> *主题:* oomkill issue
>
> Hello,
>
> We have been facing this oomkill issue, where task managers are getting
> restarted with this error.
> I am seeing memory consumption increasing in a linear manner, i have given
> memory and CPU as high as possible but still facing the same issue.
>
> We are using rocksdb for the state backend, is there a way to find which
> operator causing this issue? or find which operator takes more memory? Any
> good practice that we can follow? We are using broadcast state.
>
> Thanks,
> Prashant
>
>

Re: 回复: oomkill issue

Reply via email to