Hi, I am experiencing an issue when a job manager is trying to recover using a HA setup. When the job manager starts again and tries to resume from the last checkpoints, it gets killed by Kubernetes (I guess), since I can see the following in the logs while the jobs are deployed:
INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested. I am requesting enough memory for it, 3000Gi, and it is configured to use 2048Gb of memory. I have tried to increase the max perm size, but did not see an improvement. Any suggestions to help diagnose this? I have the following: Flink 1.6.0 (same with 1.5.1) Azure AKS with Kubernetes 1.11 State management using RocksDB with checkpoints stored in Azure Data Lake Thanks! Bruno