[ https://issues.apache.org/jira/browse/FLINK-29566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618264#comment-17618264 ]
Xin Hao commented on FLINK-29566: --------------------------------- Will submit a PR for that. > Reschedule the cleanup logic if cancel job failed > ------------------------------------------------- > > Key: FLINK-29566 > URL: https://issues.apache.org/jira/browse/FLINK-29566 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator > Reporter: Xin Hao > Priority: Minor > > Currently, when we remove the FlinkSessionJob object, > we always remove the object even if the Flink job is not being canceled > successfully. > > This is *not semantic consistent* if the FlinkSessionJob has been removed but > the Flink job is still running. > > One of the scenarios is that if we deploy a FlinkDeployment with HA mode. > When we remove the FlinkSessionJob and change the FlinkDeployment at the same > time, > or if the TMs are restarting because of some bugs such as OOM. > Both of these will cause the cancelation of the Flink job to fail because the > TMs are not available. > > We should *reschedule* the cleanup logic if the FlinkDeployment is present. > And we can add a new ReconciliationState DELETING to indicate the > FlinkSessionJob's status. > > The logic will be > {code:java} > if the FlinkDeployment is not present > delete the FlinkSessionJob object > else > if the JM is not available > reschedule > else > if cancel job successfully > delete the FlinkSessionJob object > else > reschedule{code} > When we cancel the Flink job, we need to verify all the jobs with the same > name have been deleted in case of the job id is changed after JM restarted. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)