Yu Li created FLINK-13593:
-----------------------------

             Summary: Prevent failing the wrong job in CheckpointFailureManager
                 Key: FLINK-13593
                 URL: https://issues.apache.org/jira/browse/FLINK-13593
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing
    Affects Versions: 1.9.0
            Reporter: Yu Li
             Fix For: 1.9.0


Due to the asynchronously handling of checkpoint decline message in 
{{LegacyScheduler#declineCheckpoint}}, it's possible that the message is 
handled before job status transition thus {{receiveDeclineMessage}} grabbed the 
lock in {{CheckpointCoordinator}} before {{pendingCheckpoints}} got cleared by 
{{stopCheckpointScheduler}} (as triggered by the job status listener 
{{CheckpointCoordinatorDeActivator}}). And if the job/tasks restarts quickly 
enough, the {{FailJobCallback}} in {{CheckpointFailureManager}} might 
unexpectedly fail the job again, as observed in FLINK-13527.

To resolve the issue, we need to add a safe guard when failing the job, passing 
through the {{ExecutionAttemptID}} and checking against the current executions 
to make sure the to-be-failed one is still running, so we won't fail the newly 
restarted one by accident.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to