Job Manager killed by Kubernetes during recovery

Bruno Aranda Sat, 18 Aug 2018 15:58:27 -0700

Hi,

I am experiencing an issue when a job manager is trying to recover using a
HA setup. When the job manager starts again and tries to resume from the
last checkpoints, it gets killed by Kubernetes (I guess), since I can see
the following in the logs while the jobs are deployed:


INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.

I am requesting enough memory for it, 3000Gi, and it is configured to use
2048Gb of memory. I have tried to increase the max perm size, but did not
see an improvement.

Any suggestions to help diagnose this?

I have the following:

Flink 1.6.0 (same with 1.5.1)
Azure AKS with Kubernetes 1.11
State management using RocksDB with checkpoints stored in Azure Data Lake

Thanks!

Bruno

Job Manager killed by Kubernetes during recovery

Reply via email to