[ https://issues.apache.org/jira/browse/FLINK-22088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yun Gao reassigned FLINK-22088: ------------------------------- Assignee: Yun Gao > CheckpointCoordinator might not be able to abort triggering checkpoint if > failover happens during triggering > ------------------------------------------------------------------------------------------------------------ > > Key: FLINK-22088 > URL: https://issues.apache.org/jira/browse/FLINK-22088 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.12.2, 1.13.0 > Reporter: Yun Gao > Assignee: Yun Gao > Priority: Minor > Labels: auto-unassigned > > Currently when job failover, it would try to cancel all the pending > checkpoint via CheckpointCoordinatorDeActivator#jobStatusChanges -> > stopCheckpointScheduler, it would try to cancel all the pending checkpoints > and also set periodicScheduling to false. > If at this time there is just one checkpoint start triggering, it might > acquire all the execution to trigger before failover and start triggering. > ideally it should be aborted in createPendingCheckpoint-> > preCheckGlobalState. However, since the check and creating pending checkpoint > is in two different scope, there might be cases the > CheckpointCoordinator#stopCheckpointScheduler happens during the two lock > scope. > We may optimize this checking; However, since the execution would finally > fail to trigger checkpoint, it should not affect the rightness of the job. > Besides, even if we optimize it, there might still be cases that the > execution trigger failed due to concurrent failover. -- This message was sent by Atlassian Jira (v8.3.4#803005)