Hi, We're running Flink 1.11.3 on Kubernetes. We have a job with parallelism of 10 running on 10 task managers each with 1 task slot. The job has 4 time windows with 2 different keys, 2 windows have reducers and 2 are processed by window functions. State is stored in RocksDB.
We've noticed when a pod is restarted (say if the node it was on is restarted) the job restarts and the memory usage of the remaining 9 pods increases by roughly 1GB over the next 1-2 hours then stays at that level. If another pod restarts the remaining 9 increase in memory usage again. Eventually one or more pods reach the 6GB limit and are OOMKilled, leading to the job restarting and memory usage increasing again. If left it can lead to the situation where an OOMKill directly leads to an OOMKill which directly leads to another. At this point it requires manual intervention to resolve. I think it's exceedingly likely the excessive memory usage is in RocksDB rather than Flink, my question is whether there's anything we can do about the increase in memory usage after a failure? <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2869/Screenshot_2021-02-02_at_11.png> Best regards, Randal. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/