Till Rohrmann created FLINK-21979:
-------------------------------------

             Summary: Job can be restarted from the beginning after it reached 
a terminal state
                 Key: FLINK-21979
                 URL: https://issues.apache.org/jira/browse/FLINK-21979
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.12.2, 1.11.3, 1.13.0
            Reporter: Till Rohrmann
             Fix For: 1.14.0


Currently, the {{JobMaster}} removes all checkpoints after a job reaches a 
globally terminal state. Then it notifies the {{Dispatcher}} about the 
termination of the job. The {{Dispatcher}} then removes the job from the 
{{SubmittedJobGraphStore}}. If the {{Dispatcher}} process fails before doing 
that it might get restarted. In this case, the {{Dispatcher}} would still find 
the job in the {{SubmittedJobGraphStore}} and recover it. Since the 
{{CompletedCheckpointStore}} is empty, it would start executing this job from 
the beginning.

I think we must not remove job state before the job has not been marked as done 
or made inaccessible for any restarted processes. Concretely, we should first 
remove the job from the {{SubmittedJobGraphStore}} and only then delete the 
checkpoints. Ideally all the job related cleanup operation happens atomically.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to