[ https://issues.apache.org/jira/browse/FLINK-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845797#comment-16845797 ]
vinoyang commented on FLINK-8871: --------------------------------- [~carp84] Can the word robbery be used at will? What do you think if I say you slandered me? Am I replacing "assignee" from [~yunta] to mine? I have said that he can't submit PR? If a person comments that he could submit a PR, must he submit the PR? The "suggestion" I said means that I advise him to wait. When and where did I say that my suggestion is towards his plan? Unfortunately, if you understand the details, then you will not say so. According to Stephan's suggestion, if notifyCheckpointAbort must wait for FLINK-12477 to be implemented, FLINK-12477 includes the refactor of notifyComplete, but notifyCheckpointAbort is similar to the implementation to notifyComplete. How did you come to a conclusion that they have no relationship? You are free to express your opinion. Please say your own opinion on the solution, instead of using the word "robbery" for me. If you use it, I can only ask you not to comment on me. > Checkpoint cancellation is not propagated to stop checkpointing threads on > the task manager > ------------------------------------------------------------------------------------------- > > Key: FLINK-8871 > URL: https://issues.apache.org/jira/browse/FLINK-8871 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.3.2, 1.4.1, 1.5.0, 1.6.0, 1.7.0 > Reporter: Stefan Richter > Assignee: vinoyang > Priority: Critical > > Flink currently lacks any form of feedback mechanism from the job manager / > checkpoint coordinator to the tasks when it comes to failing a checkpoint. > This means that running snapshots on the tasks are also not stopped even if > their owning checkpoint is already cancelled. Two examples for cases where > this applies are checkpoint timeouts and local checkpoint failures on a task > together with a configuration that does not fail tasks on checkpoint failure. > Notice that those running snapshots do no longer account for the maximum > number of parallel checkpoints, because their owning checkpoint is considered > as cancelled. > Not stopping the task's snapshot thread can lead to a problematic situation > where the next checkpoints already started, while the abandoned checkpoint > thread from a previous checkpoint is still lingering around running. This > scenario can potentially cascade: many parallel checkpoints will slow down > checkpointing and make timeouts even more likely. > > A possible solution is introducing a {{cancelCheckpoint}} method as > counterpart to the {{triggerCheckpoint}} method in the task manager gateway, > which is invoked by the checkpoint coordinator as part of cancelling the > checkpoint. -- This message was sent by Atlassian JIRA (v7.6.3#76005)