Hi Randal,
The image is too blurred to be clearly seen.
I have a few questions.
- IIUC, you are using the standalone K8s deployment [1], not the native K8s
deployment [2]. Could you confirm that?
- How is the memory measured?

Thank you~

Xintong Song


[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html

[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html



On Tue, Feb 2, 2021 at 7:24 PM Randal Pitt <randal.p...@foresite.com> wrote:

> Hi,
>
> We're running Flink 1.11.3 on Kubernetes. We have a job with parallelism of
> 10 running on 10 task managers each with 1 task slot. The job has 4 time
> windows with 2 different keys, 2 windows have reducers and 2 are processed
> by window functions. State is stored in RocksDB.
>
> We've noticed when a pod is restarted (say if the node it was on is
> restarted) the job restarts and the memory usage of the remaining 9 pods
> increases by roughly 1GB over the next 1-2 hours then stays at that level.
> If another pod restarts the remaining 9 increase in memory usage again.
> Eventually one or more pods reach the 6GB limit and are OOMKilled, leading
> to the job restarting and memory usage increasing again.
>
> If left it can lead to the situation where an OOMKill directly leads to an
> OOMKill which directly leads to another. At this point it requires manual
> intervention to resolve.
>
> I think it's exceedingly likely the excessive memory usage is in RocksDB
> rather than Flink, my question is whether there's anything we can do about
> the increase in memory usage after a failure?
>
> <
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2869/Screenshot_2021-02-02_at_11.png>
>
>
> Best regards,
>
> Randal.
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>

Reply via email to