[ https://issues.apache.org/jira/browse/FLINK-35857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated FLINK-35857: ----------------------------------- Labels: pull-request-available (was: ) > Operator restart failed job without latest checkpoint > ----------------------------------------------------- > > Key: FLINK-35857 > URL: https://issues.apache.org/jira/browse/FLINK-35857 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator > Affects Versions: kubernetes-operator-1.6.1 > Environment: flink kubernetes operator version: 1.6.1 > flink version 1.15.2 > flink job config: > *execution.shutdown-on-application-finish=false* > Reporter: chenyuzhi > Priority: Major > Labels: pull-request-available > Attachments: image-2024-07-17-15-03-29-618.png, > image-2024-07-17-15-04-32-913.png > > > Using flink kubernetes operator, with config: > {code:java} > kubernetes.operator.job.restart.failed=true {code} > We got different failed-job restart result in two case. > Case1: > A job with period checkpoint enable and an intial checkpoint path, when it > failed (with latestCompletedCheckpointId=19434), the operator will auto > redeploy the deployment with the same job_id and latest checkpoint > path(CheckpointId=19434) as intial checkpoint path > > !image-2024-07-17-15-03-29-618.png|width=763,height=301! > > Case2: > A job with period checkpoint enable but no intial checkpoint, when it > failed(with latestCompletedCheckpointId=30), the operator will auto redeploy > the deployment with different job_id and no intial checkpoint path. > !image-2024-07-17-15-04-32-913.png|width=759,height=287! > > In the case2, the redeploy behaviour may case data inconsitence. For example > the kafka source connector may consume data from earliest/latest offset. > > Thus i think a job with period checkpoint enable but no intial checkpoint, > should be restart with the same job_id and latest checkpoint path, just like > case1. -- This message was sent by Atlassian Jira (v8.20.10#820010)