Hi Randal, The image is too blurred to be clearly seen. I have a few questions. - IIUC, you are using the standalone K8s deployment [1], not the native K8s deployment [2]. Could you confirm that? - How is the memory measured?
Thank you~ Xintong Song [1] https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html [2] https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html On Tue, Feb 2, 2021 at 7:24 PM Randal Pitt <randal.p...@foresite.com> wrote: > Hi, > > We're running Flink 1.11.3 on Kubernetes. We have a job with parallelism of > 10 running on 10 task managers each with 1 task slot. The job has 4 time > windows with 2 different keys, 2 windows have reducers and 2 are processed > by window functions. State is stored in RocksDB. > > We've noticed when a pod is restarted (say if the node it was on is > restarted) the job restarts and the memory usage of the remaining 9 pods > increases by roughly 1GB over the next 1-2 hours then stays at that level. > If another pod restarts the remaining 9 increase in memory usage again. > Eventually one or more pods reach the 6GB limit and are OOMKilled, leading > to the job restarting and memory usage increasing again. > > If left it can lead to the situation where an OOMKill directly leads to an > OOMKill which directly leads to another. At this point it requires manual > intervention to resolve. > > I think it's exceedingly likely the excessive memory usage is in RocksDB > rather than Flink, my question is whether there's anything we can do about > the increase in memory usage after a failure? > > < > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2869/Screenshot_2021-02-02_at_11.png> > > > Best regards, > > Randal. > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ >