Hi Yu, Thanks for your reply.
When i run below script ``` *jeprof --show_bytes -svg `which java` /tmp/jeprof.out.301.1009.i1009.heap > 1009.svg* ``` i am getting below error ``` *Gathering CPU profile from http:///pprof/profile?seconds=30 for 30 seconds to /root/jeprof/java.1701718686.Be patient...Failed to get profile: curl -s --fail --max-time 90 'http:///pprof/profile?seconds=30' > /root/jeprof/.tmp.java.1701718686.: No such file or directory* ``` Any input on this? However, oomkill was resolve with below rocksdb configurations - "state.backend.rocksdb.memory.managed": "false", "state.backend.rocksdb.block.cache-size": "10m", "state.backend.rocksdb.writebuffer.size": "128m", "state.backend.rocksdb.writebuffer.count": "134217728" "state.backend.rocksdb.ttl.compaction.filter.enabled":"true" Thanks, Prashant On Mon, Nov 27, 2023 at 7:11 PM Xuyang <xyzhong...@163.com> wrote: > Hi, Prashant. > I think Yu Chen has given a professional troubleshooting ideas. Another > thing I want to ask is whether you use some > user defined function to store some objects? You can firstly dump the > memory and get more details to check for memory leaks. > > > -- > Best! > Xuyang > > > 在 2023-11-28 09:12:01,"Yu Chen" <yuchen.e...@gmail.com> 写道: > > Hi Prashant, > > OOMkill was mostly caused by workset memory exceed the pod limit. > We have to first expand the OVERHEAD memory properly by the following > params to observe if the problem can be solved. > ``` > taskmanager.memory.jvm-overhead.max=1536m > taskmanager.memory.jvm-overhead.min=1536m > ``` > > And if the OOMKill still exists, we need to suspect if the task has an > off-heap memory leak. > One of the most popular tools, jemallc, is recommended. You have to > install the jemalloc in the image arrording to the document[1]. > After that, you can enable jemalloc profiling by setting environment for > the taskmanager: > ``` > > containerized.taskmanager.env.MALLOC_CONF= > prof:true,lg_prof_interval:30,lg_prof_sample:16,prof_prefix:/tmp/jeprof.out > > ``` > After running for a while, you can log into the Taskmanager to generate > svg files to troubleshoot off-heap memory distribution. > ``` > jeprof --show_bytes -svg `which java` /tmp/jeprof.out.301.1009.i1009.heap > > 1009.svg > ``` > > Otherwise, if the OOMKill no longer occurs, but the GC overhead limit > exceeded, then you should dump heap memory to find out what objects are > taking up so much of the memory. > Here is the command for you. > ``` > jmap -dump:live,format=b,file=/tmp/heap.hprof <pid> > ``` > > [1] Using jemalloc to Optimize Memory Allocation — Sentieon Appnotes > 202308.01 documentation <https://support.sentieon.com/appnotes/jemalloc/> > > Best, > Yu Chen > ------------------------------ > *发件人:* prashant parbhane <parbhane....@gmail.com> > *发送时间:* 2023年11月28日 1:42 > *收件人:* user@flink.apache.org <user@flink.apache.org> > *主题:* oomkill issue > > Hello, > > We have been facing this oomkill issue, where task managers are getting > restarted with this error. > I am seeing memory consumption increasing in a linear manner, i have given > memory and CPU as high as possible but still facing the same issue. > > We are using rocksdb for the state backend, is there a way to find which > operator causing this issue? or find which operator takes more memory? Any > good practice that we can follow? We are using broadcast state. > > Thanks, > Prashant > >