[jira] [Updated] (FLINK-35857) Operator restart failed job without latest checkpoint

ASF GitHub Bot (Jira) Fri, 19 Jul 2024 22:34:23 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-35857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated FLINK-35857:
-----------------------------------
    Labels: pull-request-available  (was: )

> Operator restart failed job without latest checkpoint
> -----------------------------------------------------
>
>                 Key: FLINK-35857
>                 URL: https://issues.apache.org/jira/browse/FLINK-35857
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.6.1
>         Environment:  flink kubernetes operator version: 1.6.1
> flink version 1.15.2
> flink job config:
> *execution.shutdown-on-application-finish=false*
>            Reporter: chenyuzhi
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2024-07-17-15-03-29-618.png, 
> image-2024-07-17-15-04-32-913.png
>
>
> Using flink kubernetes operator, with config: 
> {code:java}
> kubernetes.operator.job.restart.failed=true {code}
> We got different failed-job restart result in two case. 
> Case1:  
>  A job with period checkpoint enable and an intial checkpoint path, when it 
> failed (with latestCompletedCheckpointId=19434), the operator will auto 
> redeploy the deployment with the same job_id and latest checkpoint 
> path(CheckpointId=19434)  as intial checkpoint path
>  
> !image-2024-07-17-15-03-29-618.png|width=763,height=301!
>  
> Case2:
>  A job with period checkpoint enable but  no intial checkpoint, when it 
> failed(with latestCompletedCheckpointId=30), the operator will auto redeploy 
> the deployment with different job_id  and no intial checkpoint path.
> !image-2024-07-17-15-04-32-913.png|width=759,height=287!
>  
> In the case2, the redeploy behaviour may case data inconsitence. For example 
> the kafka source connector may consume data from earliest/latest offset.
>  
> Thus i think  a job with period checkpoint enable but  no intial checkpoint, 
> should be restart with the same job_id and latest checkpoint path, just like 
> case1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-35857) Operator restart failed job without latest checkpoint

Reply via email to