[ https://issues.apache.org/jira/browse/FLINK-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848149#comment-16848149 ]
Yun Tang commented on FLINK-8871: --------------------------------- Thanks for [~srichter]'s comment. I agree we should fix FLINK-11662 first and it would benefit those issues [~yanghua] planed to track (FLINK-10724, FLINK-12209), not to mention FLINK-11662 has been tagged as the only [known issue|https://flink.apache.org/news/2019/04/09/release-1.8.0.html#known-issues] when releasing Flink-1.8. [~StephanEwen]'s [comment|https://issues.apache.org/jira/browse/FLINK-10930?focusedCommentId=16712017&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16712017] left under FLINK-10930 should already provide valuable information of fixing FLINK-11662. IMO, [~yanghua]'s [PR-8322|https://github.com/apache/flink/pull/8322] could be benefited if we fix FLINK-11662 first. When talking about the progress of our team done. We have already implemented a checkpoint cancellation mechanism with notify message and a cleanup mechanism on JM side to scan file systems to delete useless files periodically. From these issues, [~yanghua] might focus on the refactoring of checkpoint failure procedure. While we mainly focus on implementing functionality of checkpoint cancellation and useless checkpoint cleaner. > Checkpoint cancellation is not propagated to stop checkpointing threads on > the task manager > ------------------------------------------------------------------------------------------- > > Key: FLINK-8871 > URL: https://issues.apache.org/jira/browse/FLINK-8871 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.3.2, 1.4.1, 1.5.0, 1.6.0, 1.7.0 > Reporter: Stefan Richter > Priority: Critical > > Flink currently lacks any form of feedback mechanism from the job manager / > checkpoint coordinator to the tasks when it comes to failing a checkpoint. > This means that running snapshots on the tasks are also not stopped even if > their owning checkpoint is already cancelled. Two examples for cases where > this applies are checkpoint timeouts and local checkpoint failures on a task > together with a configuration that does not fail tasks on checkpoint failure. > Notice that those running snapshots do no longer account for the maximum > number of parallel checkpoints, because their owning checkpoint is considered > as cancelled. > Not stopping the task's snapshot thread can lead to a problematic situation > where the next checkpoints already started, while the abandoned checkpoint > thread from a previous checkpoint is still lingering around running. This > scenario can potentially cascade: many parallel checkpoints will slow down > checkpointing and make timeouts even more likely. > > A possible solution is introducing a {{cancelCheckpoint}} method as > counterpart to the {{triggerCheckpoint}} method in the task manager gateway, > which is invoked by the checkpoint coordinator as part of cancelling the > checkpoint. -- This message was sent by Atlassian JIRA (v7.6.3#76005)