Re: Automatically resuming failed jobs in K8s

2020-06-12 Thread Averell
Thank you very much, Yang. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Automatically resuming failed jobs in K8s

2020-06-10 Thread Yang Wang
Hi Averell, Thanks for trying the native K8s integration. All your issues are due to high availability not configured. If you start a HA Flink cluster, like following, then when JobManager/TaskManager terminated exceptionally, all the jobs could recover and restore from the latest checkpoint. Even

Automatically resuming failed jobs in K8s

2020-06-10 Thread Averell
Hi, I'm running some jobs using native Kubernetes. Sometimes, for some unrelated issue with our K8s cluster (e.g: K8s node crashed), my Flink pods are gone. The JM pod, as it is deployed using a deployment, will be re-created automatically. However, all of my jobs are lost. What I have to do now ar