Yu Li created FLINK-13593:
-----------------------------
Summary: Prevent failing the wrong job in CheckpointFailureManager
Key: FLINK-13593
URL: https://issues.apache.org/jira/browse/FLINK-13593
Project: Flink
Issue Type: Bug
Components: Runtime / Checkpointing
Affects Versions: 1.9.0
Reporter: Yu Li
Fix For: 1.9.0
Due to the asynchronously handling of checkpoint decline message in
{{LegacyScheduler#declineCheckpoint}}, it's possible that the message is
handled before job status transition thus {{receiveDeclineMessage}} grabbed the
lock in {{CheckpointCoordinator}} before {{pendingCheckpoints}} got cleared by
{{stopCheckpointScheduler}} (as triggered by the job status listener
{{CheckpointCoordinatorDeActivator}}). And if the job/tasks restarts quickly
enough, the {{FailJobCallback}} in {{CheckpointFailureManager}} might
unexpectedly fail the job again, as observed in FLINK-13527.
To resolve the issue, we need to add a safe guard when failing the job, passing
through the {{ExecutionAttemptID}} and checking against the current executions
to make sure the to-be-failed one is still running, so we won't fail the newly
restarted one by accident.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)