Hi Marco, When you say OOM, I assume you mean TM pod being OOMKilled, is that correct? If so, this usually means that the TM is using more than the actual memory allocated to the pod. First I would check your memory configuration to figure out where this extra memory use is coming from. This is a non trivial task, and I’ll list down some common situations I’ve seen tin the past to get you started.
* Misconfigured process memory. Flink configuration of `taskmanager.memory.process.size` will set the memory of the entire TM, which Flink will use and break down into smaller buckets. IF this is higher than memory resource of container, this will cause OOMKilled situations * User code has memory leak (e.g. spins up too many threads). Would be useful to test the Flink job you have on a local cluster and monitor the memory use. * State backend (if you use rocksdb) using too much memory. You can also look at [1] and [2] for more information. Regards, Hong [1] Talk on Flink memory utilisation https://www.youtube.com/watch?v=F5yKSznkls8 [2] Flink description of TM memory breakdown https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/memory/mem_setup_tm/ From: marco andreas <marcoandreas...@gmail.com> Date: Wednesday, 25 January 2023 at 19:57 To: user <user@flink.apache.org> Subject: [EXTERNAL] OOM taskmanager CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Hello, We are deploying a flink application cluster in kubernetes, 2 pods one for the JM and the other for the TM. The problem is when we launch load tests we see that task manager memory usage increases, after the tests are finished and flink stop processing data the memory usage never comes down where it was before, eventually when we launch tests again and again the memory of TM continues to grow until it reaches the memory resource limit specified in the container templates and it get killed because of OOM. Has anyone faced the same issue and what is the best way to investigate this error in order to know the root cause of why the memory usage of the TM never comes down when flink finishes processing. FLink version is 1.16.0. Thanks,