chenyuzhi created FLINK-35857: --------------------------------- Summary: Operator restart failed job without latest checkpoint Key: FLINK-35857 URL: https://issues.apache.org/jira/browse/FLINK-35857 Project: Flink Issue Type: Bug Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.6.1 Environment: flink kubernetes operator version: 1.6.1
flink version 1.15.2 flink job config: *execution.shutdown-on-application-finish=false* Reporter: chenyuzhi Attachments: image-2024-07-17-15-03-29-618.png, image-2024-07-17-15-04-32-913.png Using flink kubernetes operator, with config: {code:java} kubernetes.operator.job.restart.failed=true {code} We got different failed-job restart result in two case. Case1: A job with period checkpoint enable and an intial checkpoint path, when it failed, the operator will auto redeploy the deployment with the same job_id and latest checkpoint path !image-2024-07-17-15-03-29-618.png! Case2: A job with period checkpoint enable but no intial checkpoint, when it failed, the operator will auto redeploy the deployment with different job_id and no intial checkpoint path. !image-2024-07-17-15-04-32-913.png! I think in the case2, the redeploy behaviour may case data inconsitence. For example the kafka source connector may consume data from earliest/latest offset. Thus i think a job with period checkpoint enable but no intial checkpoint, should be restart with the same job_id and latest checkpoint path, just like case1. -- This message was sent by Atlassian Jira (v8.20.10#820010)